Header

UZH-Logo

Maintenance Infos

A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC


Kors, Jan A; Clematide, Simon; Akhondi, Saber A; van Mulligen, Erik M; Rebholz-Schuhmann, Dietrich (2015). A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC. Journal of the American Medical Informatics Association (JAMIA), 22(5):948-956.

Abstract

Objective To create a multilingual gold-standard corpus for biomedical concept recognition. Materials and methods We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. Results The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. Discussion The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. Conclusion To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated

Abstract

Objective To create a multilingual gold-standard corpus for biomedical concept recognition. Materials and methods We selected text units from different parallel corpora (Medline abstract titles, drug labels, biomedical patent claims) in English, French, German, Spanish, and Dutch. Three annotators per language independently annotated the biomedical concepts, based on a subset of the Unified Medical Language System and covering a wide range of semantic groups. To reduce the annotation workload, automatically generated preannotations were provided. Individual annotations were automatically harmonized and then adjudicated, and cross-language consistency checks were carried out to arrive at the final annotations. Results The number of final annotations was 5530. Inter-annotator agreement scores indicate good agreement (median F-score 0.79), and are similar to those between individual annotators and the gold standard. The automatically generated harmonized annotation set for each language performed equally well as the best annotator for that language. Discussion The use of automatic preannotations, harmonized annotations, and parallel corpora helped to keep the manual annotation efforts manageable. The inter-annotator agreement scores provide a reference standard for gauging the performance of automatic annotation techniques. Conclusion To our knowledge, this is the first gold-standard corpus for biomedical concept recognition in languages other than English. Other distinguishing features are the wide variety of semantic groups that are being covered, and the diversity of text genres that were annotated

Statistics

Citations

Dimensions.ai Metrics
16 citations in Web of Science®
21 citations in Scopus®
Google Scholar™

Altmetrics

Downloads

23 downloads since deposited on 04 Oct 2018
9 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:National licences > 142-005
Dewey Decimal Classification:Unspecified
Scopus Subject Areas:Health Sciences > Health Informatics
Uncontrolled Keywords:concept identification; gold-standard corpus; inter-annotator agreement; multilinguality; semantic enrichment
Language:English
Date:1 September 2015
Deposited On:04 Oct 2018 19:18
Last Modified:15 Apr 2021 14:47
Publisher:BMJ Publishing Group
ISSN:1067-5027
OA Status:Hybrid
Free access at:PubMed ID. An embargo period may apply.
Publisher DOI:https://doi.org/10.1093/jamia/ocv037
PubMed ID:25948699

Download

Hybrid Open Access

Download PDF  'A multilingual gold-standard corpus for biomedical concept recognition: the Mantra GSC'.
Preview
Content: Published Version
Language: English
Filetype: PDF (Nationallizenz 142-005)
Size: 762kB
View at publisher