Publications
Rosiński, Cezary; Rudnicka, Ewa; Pride, David; Monteil, Alain; Oleksy, Marcin; Foppiano, Luca; Docekal, Martin; Cancellieri, Matteo; Scalbert, Samuel; Romary, Laurent; Umerle, Tomasz and Knoth, Petr
This is a multidisciplinary dataset of research papers, annotated with software mentions, built by the SoFAIR project in collaboration with the CLARIN-PL team. The SoFAIR project aims to support the reproducibility of research and the reuse of data and to recognise the contribution of software developers. The dataset was created through a workflow consisting of text selection, automatic pre-annotation of documents using the Softcite…
Rosiński, Cezary; Rudnicka, Ewa; Pride, David; Monteil, Alain; Oleksy, Marcin; Foppiano, Luca; Docekal, Martin; Cancellieri, Matteo; Scalbert, Samuel; Romary, Laurent; Umerle, Tomasz and Knoth, Petr (2025). SoFAIR Dataset: A Multidisciplinary Dataset of Research Papers Annotated with Software Mentions. Submitted to Nature Scientific Data, November 2025
Identifying and Classifying Software Mentions in Full-Text Scholarly Documents
Pride, David; Guenci, Matteo; Docekal, Martin; Peroni, Silvio and Knoth, Petr
In this paper, we evaluate several LLM-based approaches using three gold-standard corpora, comparing prompting strategies and configurations against established baselines. Our contributions are threefold: (1) we provide the first systematic evaluation of LLMs for software mention extraction, (2) we analyse their strengths and weaknesses relative to prior techniques, and (3) we discuss implications for reproducibility and open science. Results show that LLMs significantly improve extraction accuracy and adaptability, advancing efforts to integrate software into the scholarly record.
Pride, David; Guenci, Matteo; Docekal, Martin; Peroni, Silvio and Knoth, Petr (2025). Identifying and Classifying Software Mentions in Full-Text Scholarly Documents. In: 25th ACM/IEEE Joint Conference on Digital Libraries (JCDL 2025), 15-19 December 2025.
Cancellieri, Matteo; Docekal, Martin; Pride, David; Gruenpeter, Morane; Douard, David and Knoth, Petr
The discoverability, attribution, and reusability of open research software are often hindered by its obscurity within academic manuscripts. To address this, the SoFAIR project (2024–2025) introduces a comprehensive workflow leveraging machine learning tools for extracting software mentions from research papers. The project integrates repository systems, authors, and services like HAL and Software Heritage to ensure proper archiving, citation, and accessibility of research software in alignment with FAIR principles. To enable interoperable communication across the various systems we present an integration of the COAR Notify Protocol…
Cancellieri, Matteo; Docekal, Martin; Pride, David; Gruenpeter, Morane; Douard, David and Knoth, Petr(2025). Interoperable verification and dissemination of software assets in repositories using COAR Notify. In: The 20th International Conference on Open Repositories (Pride, David; Cancellieri, Matteo and Knoth, Petr eds.), 18-21 Jun 2025, Chicago, Illinois, USA.
Knoth, Petr; Laurent, Romary; Lopez, Patrice; Di Cosmo, Roberto; Smrz, Pavel; Umerle, Tomasz; Harrison, Melissa; Monteil, Alain; Cancellieri, Matteo and Pride, David
A key issue hindering discoverability, attribution and reusability of open research software is that its existence often remains hidden within the manuscript of research papers. For these resources to become first-class bibliographic records, they first need to be identified and subsequently registered with persistent identifiers (PIDs) to be made FAIR (Findable, Accessible, Interoperable and Reusable). To this day, much open research software fails to meet FAIR principles and software resources are mostly not explicitly linked from the manuscripts that introduced…
Knoth, Petr; Laurent, Romary; Lopez, Patrice; Di Cosmo, Roberto; Smrz, Pavel; Umerle, Tomasz; Harrison, Melissa; Monteil, Alain; Cancellieri, Matteo and Pride, David (2025). Making Software FAIR: A machine-assisted workflow for the research software lifecycle. In: 19th International Conference on Open Repositories (OR2024), 3-6 Jun 2024, Göteborg, Sweden.