Navigation auf zora.uzh.ch

Search ZORA

ZORA (Zurich Open Repository and Archive)

Assessment of NER solutions against the first and second CALBC Silver Standard Corpus

Abstract

Background Competitions in text mining have been used to measure the performance of automatic text processing solutions against a manually annotated gold standard corpus (GSC). The preparation of the GSC is time-consuming and costly and the final corpus consists at the most of a few thousand documents annotated with a limited set of semantic groups. To overcome these shortcomings, the CALBC project partners (PPs) have produced a large-scale annotated biomedical corpus with four different semantic groups through the harmonisation of annotations from automatic text mining solutions, the first version of the Silver Standard Corpus (SSC-I). The four semantic groups are chemical entities and drugs (CHED), genes and proteins (PRGE), diseases and disorders (DISO) and species (SPE). This corpus has been used for the First CALBC Challenge asking the participants to annotate the corpus with their text processing solutions. Results All four PPs from the CALBC project and in addition, 12 challenge participants (CPs) contributed annotated data sets for an evaluation against the SSC-I. CPs could ignore the training data and deliver the annotations from their genuine annotation system, or could train a machine-learning approach on the provided pre-annotated data. In general, the performances of the annotation solutions were lower for entities from the categories CHED and PRGE in comparison to the identification of entities categorized as DISO and SPE. The best performance over all semantic groups were achieved from two annotation solutions that have been trained on the SSC-I. The data sets from participants were used to generate the harmonised Silver Standard Corpus II (SSC-II), if the participant did not make use of the annotated data set from the SSC-I for training purposes. The performances of the participants’ solutions were again measured against the SSC-II. The performances of the annotation solutions showed again better results for DISO and SPE in comparison to CHED and PRGE. Conclusions The SSC-I delivers a large set of annotations (1,121,705) for a large number of documents (100,000 Medline abstracts). The annotations cover four different semantic groups and are sufficiently homogeneous to be reproduced with a trained classifier leading to an average F-measure of 85%. Benchmarking the annotation solutions against the SSC-II leads to better performance for the CPs’ annotation solutions in comparison to the SSC-I.

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Scopus Subject Areas:Physical Sciences > Information Systems
Physical Sciences > Computer Science Applications
Health Sciences > Health Informatics
Physical Sciences > Computer Networks and Communications
Language:English
Date:2011
Deposited On:24 Feb 2011 14:53
Last Modified:15 Jan 2025 04:31
Publisher:BioMed Central
Series Name:Journal Of Biomedical Semantics
ISSN:2041-1480
Additional Information:Presented at the Fourth International Symposium on Semantic Mining in Biomedicine
OA Status:Gold
Free access at:PubMed ID. An embargo period may apply.
Publisher DOI:https://doi.org/10.1186/2041-1480-2-S5-S11
Related URLs:http://www.smbm.eu
PubMed ID:22166494
Download PDF  'Assessment of NER solutions against the first and second CALBC Silver Standard Corpus'.
Preview
  • Content: Published Version
  • Licence: Creative Commons: Attribution 2.0 Generic (CC BY 2.0)

Metadata Export

Statistics

Citations

Dimensions.ai Metrics

Altmetrics

Downloads

109 downloads since deposited on 24 Feb 2011
7 downloads since 12 months
Detailed statistics

Authors, Affiliations, Collaborations

Similar Publications