In the following, we first describe the used data sets. Afterwards, all six used algorithms are shortly described.
Data setsIn the following section, we first describe the two main disease NER data sets (NCBI and BC5CDR) that follow the same annotation guidelines and are of comparable size. Thereafter, the three additional data sets are described. Thereof only one of the data sets follows the same annotation guidelines. The NCBI and BC5CDR corpora both consist of PubMed abstracts with manually curated disease annotations. The NCBI corpus with detailed annotation guidelines was released first. For the generation of the BC5CDR corpus, the previously published NCBI disease guidelines were re-used. The authors stated that “whenever possible, we will follow closely the guidelines of constructing NCBI disease corpus for annotating disease mentions” [21].
The NCBI Disease corpus was released by the National Center for Biotechnology Information (NCBI) and is “fully annotated at the mention and concept level to serve as a research resource for the biomedical natural language processing community” [2]. It contains 739 PubMed abstracts with a total of 6,892 disease mentions, annotated by a total of 14 annotators. Two annotators were given the same data so that a double-annotation could be performed. The inter-annotator agreement was determined by means of the F1-score (see Section 5.3) for each pair of annotators. The average F1-score amounts to 88% [2].
The BioCreative V Chemical Disease Relation (BC5CDR) was released by the organization BioCreative. The BC5CDR corpus contains 1,500 abstracts including disease and chemical annotations at mention level as well as their interactions (relations). In total, the data set contains 12,848 disease mentions [3]. For the present work, only the corpus containing disease mentions is used. Here, the inter-annotator agreement has been determined by means of the Jaccard distance. The Jaccard index divides the overlap of both sets (annotations) by the number in either set [22]. To determine the Jaccard distance, the index needs to be subtracted from one. The inter-annotator agreement amounts to 87.49% [3].
The NCBI training data set consists of 593 abstracts and the BC5CDR training data set consists of 500. In terms of unique mentions and concepts, they are also very similar. Whereas the NCBI training set contains 632 unique concepts, 649 can be found in the BC5CDR training set. In the test sets, huge differences can be found concerning the amount. The NCBI Disease test set only consists of 100 abstracts, the BC5CDR test set, however, consists of 500 abstracts as well. Therefore, the latter contains significantly more unique mentions and concepts. A detailed overview can be seen in Table 5.
Table 5 Statistics of used disease entity recognition data setsIn our work, we analyze and compare these data sets on different levels: on mention level, on concept level and based on the whole corpus. For the latter, we apply the tool scattertext to visualize the linguistic variations [13]. In addition, we use the randomly generated PubMed corpus to perform a linguistic variation analysis between the annotated corpora and PubMed. This corpus was generated by randomly choosing 500 abstracts from all PubMed abstracts with a publication date between 1990 and 2021 (a total of 23,631,092 articles).
As three further, related data sets, we use the miRNA-disease corpus [6], the BioNLP13-CG corpus [7] and the COVID Disease corpus [12]. The miRNA-disease data set is split into training and test set. The training set consists of 200 abstracts, the test set consists of 100. The training set contains a total of 461 unique disease mentions, whereas the test set contains 224. In contrast to the NCBI and BC5CDR corpora, for this corpus, different, more simplified annotation guidelines were released that for example restrict the annotation to nouns. The BioNLP13-CG corpus consists of a total of 600 abstracts, split into training, development and test set. The test set contains 260 unique mentions. The COVID Disease data set is the smallest, consisting of 50 annotated abstracts. It has been developed as an independent test set for disease named entity models for COVID-19 related articles. Due its focus on COVID-19, it only contains 68 unique mentions.
NER AlgorithmsWe investigated six different publicly available algorithms for disease named entity recognition in this work, that will be described in the following. Whereas we trained BioBERT and HUNER in this study, we applied the other algorithms “as is”. We provide an overview about the sources in the Availability Section. The applied algorithms will be described in the following.
BioBERT [11] is based on Bidirectional Encoder Representations from Transformers (BERT) [10]. As pre-trained model, we used BioBERT-Base v1.0 (+ PubMed 200K + PMC 270K) published by Lee et al. [11]. For fine-tuning, we used the library Transformers [23] and pytorch. In total, we trained five different models. First, we used the NCBI and BC5CDR training corpora and trained them both individually and on the combination on them. For the latter setting, the batches were shuffled randomly to avoid a higher influence of one data set over the other. The training parameters, investigated via cross-validation, can be seen in Table 6. Additionally, we used default parameters to train two further models on the miRNA-disease and BioNLP13-CG corpora (see also Table 6).
Table 6 Hyperparameters used for fine-tuning BioBERTscipaCy is based on the python library spaCy [24] that includes tools for text processing in several different languages. The text processing steps include for example sentence detection, tokenization, POS tagging or NER. Therefore, a convolutional neural network is used. scispaCy is trained on top of spacy for POS tagging, dependency parsing and NER using biomedical training data. The authors provide a model trained on the BC5CDR corpus to recognize diseases and chemicals. We used this model and filtered out the chemical annotations.
DNorm is a disease recognition and normalization tool [25]. It is a serial algorithm which uses first the entity recognition tool BANNER [26] based on conditional random fields (CRFs) which is followed by an abbreviation detection tool and a normalizer. Normalization is learned following a pairwise learning to rank approach. We apply the provided model trained on the NCBI Disease corpus.
TaggerOne is a joint named entity recognition and normalization model consisting of “a semi-Markov structured linear classifier, with a rich feature approach for NER and supervised semantic indexing for normalization” [27]. The authors provide three different models: one trained on the NCBI Disease corpus, one trained on the BC5CDR Disease corpus and one trained on both of them simultaneously.
Stanza is a python package that allows the building of machine learning-based NLP pipelines (including for example tokenizers or POS-tagger but also NER modules) for 70 different languages [28]. Zhang et al. published biomedical and clinical English model packages [29]. Optimized models for both the NCBI and the BC5CDR corpus exist and are used in this study.
HUNER, developed by Weber et al., makes use of an LSTM-CRF-based architecture that is pre-trained in an semi-supervised manner and afterwards fine-tuned on a specific corpus/entity class [30]. To apply HUNER for our use-case, we downloaded the disease-all model and fine-tuned it on both the NCBI and the BC5CDR corpus, following the instructions of the authors (https://github.com/hu-ner/huner).
An overview about all available and/or trained models can be seen in Table 1.
Evaluation MetricsWe determine precision, recall and F1-score to evaluate the models. The equations are given below, where FP stands for false positive, FN for false negative and TP for true positive. To ensure consistency, we use a publicly available evaluation script (CoNLLEval script) that has been released by the Conference on Computational Natural Language Learning (CoNLL) together with a shared task. The script is available under https://github.com/sighsmile/conlleval. This requires the input data to be in the “IOB”-format where each token is labeled as B for beginning, I for inside or O for outside. Evaluation is only done on entity, not on concept level and we only take exact matches into account.
$$\begin precision = \frac \end$$
(1)
$$\begin recall = \frac \end$$
(2)
$$\begin F1-score = 2 * \frac \end$$
(3)
Comments (0)