Navigation auf zora.uzh.ch

Search ZORA

ZORA (Zurich Open Repository and Archive)

Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources

Rebholz-Schuhmann, Dietrich; Kafkas, Senay; Kim, Jee-hyub; Li, Chen; Yepes, Antonio Jimeno; Hoehndorf, Robert; Backofen, Rolf; Lewin, Ian (2013). Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources. Journal of Biomedical Semantics, 4:28.

Abstract

Motivation The identification of protein and gene names (PGNs) from the scientific literature requires semantic resources: Terminological and lexical resources deliver the term candidates into PGN tagging solutions and the gold standard corpora (GSC) train them to identify term parameters and contextual features.Ideally all three resources, i.e.~corpora, lexica and taggers, cover the same domain knowledge, and thus support identification of the same types of PGNs and cover all of them.Unfortunately, none of the three serves as a predominant standard and for this reason it is worth exploring, how these three resources comply with each other.We systematically compare different PGN taggers against publicly available corpora and analyze the impact of the included lexical resource in their performance.In particular, we determine the performance gains through false positive filtering, which contributes to the disambiguation of identified PGNs. RESULTS: In general, machine learning approaches (ML-Tag) for PGN tagging show higher F1-measureperformance against the BioCreative-II and Jnlpba GSCs (exact matching), whereas the lexicon basedapproaches (LexTag) in combination with disambiguation methods show better results on FsuPrgeand PennBio. The ML-Tag solutions balance precision and recall, whereas the LexTag solutions havedifferent precision and recall profiles at the same F1-measure across all corpora. Higher recall isachieved with larger lexical resources, which also introduce more noise (false positive results). TheML-Tag solutions certainly perform best, if the test corpus is from the same GSC as the trainingcorpus. As expected, the false negative errors characterize the test corpora and - on the other hand- the profiles of the false positive mistakes characterize the tagging solutions. Lex-Tag solutions thatare based on a large terminological resource in combination with false positive filtering produce betterresults, which, in addition, provide concept identifiers from a knowledge source in contrast to ML-Tagsolutions. CONCLUSION: The standard ML-Tag solutions achieve high performance, but not across all corpora, and thus shouldbe trained using several different corpora to reduce possible biases. The LexTag solutions havedifferent profiles for their precision and recall performance, but with similar F1-measure. This resultis surprising and suggests that they cover a portion of the most common naming standards, but copedifferently with the term variability across the corpora. The false positive filtering applied to LexTagsolutions does improve the results by increasing their precision without compromising significantlytheir recall. The harmonisation of the annotation schemes in combination with standardized lexicalresources in the tagging solutions will enable their comparability and will pave the way for a sharedstandard.

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:03 Faculty of Economics > Department of Informatics
Dewey Decimal Classification:000 Computer science, knowledge & systems
Scopus Subject Areas:Physical Sciences > Information Systems
Physical Sciences > Computer Science Applications
Health Sciences > Health Informatics
Physical Sciences > Computer Networks and Communications
Scope:Discipline-based scholarship (basic research)
Language:English
Date:2013
Deposited On:23 Oct 2013 08:16
Last Modified:09 Mar 2025 02:41
Publisher:BioMed Central
ISSN:2041-1480
OA Status:Gold
Free access at:PubMed ID. An embargo period may apply.
Publisher DOI:https://doi.org/10.1186/2041-1480-4-28
PubMed ID:24112383
Other Identification Number:merlin-id:8483
Project Information:
  • Funder: FP7
  • Grant ID: 231727
  • Project Title: CALBC - Collaborative Annotation of a Large Biomedical Corpus
Download PDF  'Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources'.
Preview
  • Content: Accepted Version
Download PDF  'Evaluating gold standard corpora against gene/protein tagging solutions and lexical resources'.
Preview
  • Content: Published Version
  • Licence: Creative Commons: Attribution 2.0 Generic (CC BY 2.0)

Metadata Export

Statistics

Citations

Dimensions.ai Metrics
10 citations in Web of Science®
13 citations in Scopus®
Google Scholar™

Altmetrics

Downloads

145 downloads since deposited on 23 Oct 2013
9 downloads since 12 months
Detailed statistics

Authors, Affiliations, Collaborations

Similar Publications