SoFAIR Dataset in the Making:Challenges of Human Annotation of Software Mentions

SoFAIR Dataset in the Making:Challenges of Human Annotation of Software Mentions

SoFAIR dataset, planned for publication in a data repository in upcoming weeks, is a multidisciplinary dataset of manually annotated mentions of software in scientific publications. It will be unique not only due to its multidisciplinary nature, but also because of how densely and comprehensively the software mentions have been annotated. To do this task SoFAIR project established a collaboration with the Polish node of CLARIN ERIC (CLARIN-PL).

Annotating software mentions in scientific publications is a complex endeavor. Here are a couple of key challenges SoFAIR encountered.

Definition. Establishing a precise definition of a “software mention,” which proved to be ambiguous across various scientific contexts. Annotators had to determine whether every software name should be included or only those that play a central role in the research described. Similar issues arose with the category of “databases,” where the context of usage often dictated whether they should be classified as software. Moreover, identifying “implicit software”—software that is used in research but not explicitly named—presented another significant challenge.
Guidelines vs reality. The annotation guidelines were a key element of the process; however, their practical application revealed ambiguities that required continuous refinement based on team feedback and discussion. The annotation process followed a 2+1 model, featuring two independent annotators and a super-annotator to resolve discrepancies, all designed to produce a high-quality gold standard. This approach proved effective across diverse scientific domains. While inter-annotator agreement varied by field and iteration, these differences highlighted the nuanced ways in which disciplinary contexts shape the identification and interpretation of software mentions. Notably, shifts in agreement are often aligned with changes in the dominant discipline, offering valuable insight into how domain-specific characteristics impact annotation outcomes.
Expertise demands. It became evident that effective annotation demands interdisciplinary expertise, encompassing not only linguistics but also computer science and other scientific disciplines. Initially, working on lengthy texts without preliminary pre-annotation proved to be time-consuming. In response, the team developed a list of “triggers”—keywords that frequently appear in the context of software mentions—to streamline the identification process.

Analysis of the results revealed that manual annotation significantly increased the granularity of the identified software mention types compared to automatic pre-annotation with the Softcite model, enabling the identification differentiation of more detailed categories.

We will deliver a more comprehensive description of the annotation challenges through a paper that has been submitted for a CLARIN 2025 Annual Conference. Stay tuned for that and for the publication of the dataset in upcoming weeks!