Navigation auf zora.uzh.ch

Search ZORA

ZORA (Zurich Open Repository and Archive)

Reducing OCR errors by combining two OCR systems

Volk, Martin; Marek, T; Sennrich, R (2010). Reducing OCR errors by combining two OCR systems. In: ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2010), Lisbon, Portugal, 16 August 2010, 61-65.

Abstract

This paper describes our efforts in building a heritage corpus of Alpine texts. We have already digitized the yearbooks of the Swiss Alpine Club from 1864 until 1982.
This corpus poses special challenges since the yearbooks are multilingual and vary in orthography and layout. We discuss methods to improve OCR performance and experiment with combining two different OCR programs with the goal to reduce the number of OCR errors. We describe a merging procedure that uses a unigram language model trained on the uncorrected corpus itself to select the best alternative, and report on evaluation results which show that the merging procedure helps to improve OCR quality.

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:410 Linguistics
000 Computer science, knowledge & systems
Language:English
Event End Date:16 August 2010
Deposited On:30 Jul 2010 12:05
Last Modified:28 Jun 2022 10:13
Funders:Swiss National Science Foundation
OA Status:Green
Official URL:http://ilk.uvt.nl/LaTeCH2010/paperlist.html
Project Information:
  • Funder: SNSF
  • Grant ID:
  • Project Title: Swiss National Science Foundation
Download PDF  'Reducing OCR errors by combining two OCR systems'.
Preview
  • Content: Accepted Version

Metadata Export

Statistics

Downloads

346 downloads since deposited on 30 Jul 2010
34 downloads since 12 months
Detailed statistics

Authors, Affiliations, Collaborations

Similar Publications