SoFAIR

Making Software FAIR: A machine-assisted workflow for the research software lifecycle

Project main objective

A key issue hindering discoverability, attribution and reusability of open research software is that its existence often remains hidden within the manuscript of research papers. For these resources to become first-class bibliographic records, they first need to be identified and subsequently registered with persistent identifiers (PIDs) to be made FAIR (Findable, Accessible, Interoperable and Reusable).

This project will extend the capabilities of critical and widely used open scholarly infrastructures (CORE, Software Heritage, HAL) and tools (GROBID) operated by the consortium partners, delivering and deploying an effective solution for the management of the research software lifecycle.

We are working with
M
full text documents
M+
metadata records
k
repositories

To incentivise good practices of software assets curation, we need to treat research software as first-class bibliographic records.

Petr Knoth – Founder & Head of CORE

The key innovations
of the
project are:

SoFAIR will focus precisely on these two main issues by extending the training data and the Softcite models to new domains and by experimenting with recent supervised machine learning techniques for entity disambiguation, in particular using graph-based similarity techniques for entity matching/alignment.

A novel machine-assisted workflow for software assets lifecycle management

New machine learning models for software mentions extraction and disambiguation

Scalable application of the technology across open repositories and relevance to both pre-existing and new software assets

SoFAIR will make possible:

ML-assisted identification of research software assets in research.

Validation of the identified assets by authors.

Registration of software assets with PIDs and their archival.