Header

UZH-Logo

Maintenance Infos

Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources


Rebholz-Schuhmann, Dietrich; Kafkas, Senay; Kim, Jee-hyub; Li, Chen; Yepes, Antonio Jimeno; Hoehndorf, Robert; Backofen, Rolf; Lewin, Ian (2013). Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources. Journal of Biomedical Semantics, 4:28.

Abstract

Motivation The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features.Ideally all three resources, i.e.~corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them.Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other.We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance.In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs. RESULTS: In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measureperformance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon basedapproaches (LexTag) in combination with disambiguation methods show better results on FsuPrgeand PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions havedifferent precision and recall profiles at the same F1-measure across all corpora. Higher recall isachieved with larger lexical resources, which also introduce more noise (false positive results). TheML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the trainingcorpus. As expected, the false negative errors characterize the test corpora and - on the other hand- the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions thatare based on a large terminological resource in combination with false positive filtering produce betterresults, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tagsolutions. CONCLUSION: The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus shouldbe trained using several different corpora to reduce possible biases. The LexTag solutions havedifferent profiles for their precision and recall performance, but with similar F1-measure. This resultis surprising and suggests that they cover a portion of the most common naming standards, but copedifferently with the term variability across the corpora. The false positive filtering applied to LexTagsolutions does improve the results by increasing their precision without compromising significantlytheir recall. The harmonisation of the annotation schemes in combination with standardized lexicalresources in the tagging solutions will enable their comparability and will pave the way for a sharedstandard.

Abstract

Motivation The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features.Ideally all three resources, i.e.~corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them.Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other.We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance.In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs. RESULTS: In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measureperformance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon basedapproaches (LexTag) in combination with disambiguation methods show better results on FsuPrgeand PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions havedifferent precision and recall profiles at the same F1-measure across all corpora. Higher recall isachieved with larger lexical resources, which also introduce more noise (false positive results). TheML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the trainingcorpus. As expected, the false negative errors characterize the test corpora and - on the other hand- the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions thatare based on a large terminological resource in combination with false positive filtering produce betterresults, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tagsolutions. CONCLUSION: The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus shouldbe trained using several different corpora to reduce possible biases. The LexTag solutions havedifferent profiles for their precision and recall performance, but with similar F1-measure. This resultis surprising and suggests that they cover a portion of the most common naming standards, but copedifferently with the term variability across the corpora. The false positive filtering applied to LexTagsolutions does improve the results by increasing their precision without compromising significantlytheir recall. The harmonisation of the annotation schemes in combination with standardized lexicalresources in the tagging solutions will enable their comparability and will pave the way for a sharedstandard.

Statistics

Citations

5 citations in Web of Science®
7 citations in Scopus®
Google Scholar™

Altmetrics

Downloads

78 downloads since deposited on 23 Oct 2013
23 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:03 Faculty of Economics > Department of Informatics
Dewey Decimal Classification:000 Computer science, knowledge & systems
Language:English
Date:2013
Deposited On:23 Oct 2013 08:16
Last Modified:08 Aug 2017 03:38
Publisher:BioMed Central
ISSN:2041-1480
Free access at:PubMed ID. An embargo period may apply.
Publisher DOI:https://doi.org/10.1186/2041-1480-4-28
PubMed ID:24112383

Download

Preview Icon on Download
Preview
Content: Accepted Version
Filetype: PDF
Size: 5MB
View at publisher
Preview Icon on Download
Preview
Content: Published Version
Filetype: PDF
Size: 2MB
Licence: Creative Commons: Attribution 2.0 Generic (CC BY 2.0)

TrendTerms

TrendTerms displays relevant terms of the abstract of this publication and related documents on a map. The terms and their relations were extracted from ZORA using word statistics. Their timelines are taken from ZORA as well. The bubble size of a term is proportional to the number of documents where the term occurs. Red, orange, yellow and green colors are used for terms that occur in the current document; red indicates high interlinkedness of a term with other terms, orange, yellow and green decreasing interlinkedness. Blue is used for terms that have a relation with the terms in this document, but occur in other documents.
You can navigate and zoom the map. Mouse-hovering a term displays its timeline, clicking it yields the associated documents.

Author Collaborations