About
SoFAIR project
To this day, much open research software fails to meet FAIR principles and software resources are mostly not explicitly linked from the manuscripts that introduced them or used them.
SoFAIR will improve and semi-automate the process for identifying, describing, registering and archiving research software, ensuring it has received a Software Heritage persistent identifier (SWH-ID). The solution extensively builds on enhancing existing widely used open source tools (GROBID) and open scholarly infrastructures (CORE, Europe PMC, HAL, Software Heritage), which are operated by the consortium members. This ensures fast and wide adoption of the project’s outputs across the global scholarly ecosystem of open repositories, offering tangible pathways to impact.
Petr Knoth – SoFAIR Project Coordinator
Project Numbers
Long-term vision
- To incentivise good practices of software assets curation, we need to treat research software as first-class bibliographic records.
- The most widely used approaches typically provide no guarantees of permanence and reproducibility.
- SoFAIR will establish a machine-assisted workflow embedded into widely used open scholarly infrastructures to assist researchers in identifying, describing, registering, linking and archiving research software.
Ambition
- Develop a machine learning assisted workflow for software assets lifecycle covering all the steps from 1) identification of software mentions in research manuscripts, 2) their validation by authors, 3) their registration with PIDs and archival if needed.
- Embed this workflow into established scholarly infrastructures, making the solution available to the global network of open repositories, covering tens of millions of open access research papers originating from across more than 10k repository systems.
Tools and infrastructures
- GROBID [developed by Patrice Lopez, subcontracted in this project by INRIA]: is an open source machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured XML/TEI encoded documents with a particular focus on technical and scientific publications.
- CORE (core.ac.uk) [developed and operated by OU]: CORE (Knoth & Zdrahal, 2012) is the world’s largest aggregator of open access research papers from repositories and journals used by around 30 million monthly active users.
- HAL [operated by INRIA] is the primary open repository in France for scientific articles. It also serves as a repository for INRIA researchers, hosting over a million fulltext research papers.
- Software Heritage [developed and operated by INRIA]: collects, preserves, and shares all software that is publicly available in source code form.
Solution
- The solution will be optimised for deployment over open content available through the global network of open repositories aggregated by CORE (core.ac.uk), which constitutes with over 32 million full texts and 250m+ metadata records from over 10k repositories currently the world’s largest collection of open access documents.
- Our ML software for extraction and disambiguation of software assets will be realised as an extension of the state-of-the-art GROBID tool. We will build on established protocols, such as OpenAIRE Guidelines v4.0, RIOXX v3 and Codemeta, to encode information about software assets and their links to research manuscripts establishing an interoperable and extensible workflow connecting open repositories (represented by HAL), aggregators (represented by CORE) and software archives (represented by Software Heritage).
- The efficacy of the developed tools and workflow will be validated in three use cases: 1) life sciences demonstrator (for Europe PMC), 2) a multi-disciplinary demonstrator for institutional repositories (represented by HAL) and a 3) digital humanities case study (with links to DARIAH and EOSC).