Header

UZH-Logo

Maintenance Infos

From historic books to annotated XML: Building a large multilingual diachronic corpus


Jitca, M; Sennrich, R; Volk, Martin (2011). From historic books to annotated XML: Building a large multilingual diachronic corpus. In: Conference of the German Society for Computational Linguistics and Language Technology (GSCL) 2011, Hamburg, Germany, 28 September 2011 - 30 September 2011. Universität Hamburg, 75-80.

Abstract

This paper introduces our approach towards annotating a large heritage corpus, which spans over 100 years of alpine literature. The corpus consists of over 16.000 articles from the yearbooks of the Swiss Alpine Club, 60% of which represent German texts, 38% French, 1% Italian and the remaining 1% Swiss German and Romansh. The present work describes the inherent difficulties in processing a multilingual corpus by referring to the most challenging annotation phases such as article identification, correction of optical character recognition (OCR) errors, tokenization, and language identification. The paper aims to raise awareness for the
efforts in building and annotating multilingual corpora rather than to evaluate each individual annotation phase.
Keywords: multilingual corpora, cultural heritage, corpus annotation, text digitization

Abstract

This paper introduces our approach towards annotating a large heritage corpus, which spans over 100 years of alpine literature. The corpus consists of over 16.000 articles from the yearbooks of the Swiss Alpine Club, 60% of which represent German texts, 38% French, 1% Italian and the remaining 1% Swiss German and Romansh. The present work describes the inherent difficulties in processing a multilingual corpus by referring to the most challenging annotation phases such as article identification, correction of optical character recognition (OCR) errors, tokenization, and language identification. The paper aims to raise awareness for the
efforts in building and annotating multilingual corpora rather than to evaluate each individual annotation phase.
Keywords: multilingual corpora, cultural heritage, corpus annotation, text digitization

Statistics

Downloads

159 downloads since deposited on 21 Nov 2011
9 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Language:English
Event End Date:30 September 2011
Deposited On:21 Nov 2011 08:45
Last Modified:27 Nov 2020 07:13
Publisher:Universität Hamburg
Series Name:Arbeiten zur Mehrsprachigkeit, Folge B. Working Papers in Multilingualism, Series B
Number:96
ISSN:0176-599X
OA Status:Green
Official URL:http://www.corpora.uni-hamburg.de/gscl2011/downloads/AZM96.pdf
Related URLs:http://www.corpora.uni-hamburg.de/gscl2011/en/
  • Content: Accepted Version