Header

UZH-Logo

Maintenance Infos

Reducing OCR errors by combining two OCR systems


Volk, Martin; Marek, T; Sennrich, R (2010). Reducing OCR errors by combining two OCR systems. In: ECAI 2010 Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH 2010), Lisbon, Portugal, 16 August 2010 - 16 August 2010, 61-65.

Abstract

This paper describes our efforts in building a heritage corpus of Alpine texts. We have already digitized the yearbooks of the Swiss Alpine Club from 1864 until 1982.
This corpus poses special challenges since the yearbooks are multilingual and vary in orthography and layout. We discuss methods to improve OCR performance and experiment with combining two different OCR programs with the goal to reduce the number of OCR errors. We describe a merging procedure that uses a unigram language model trained on the uncorrected corpus itself to select the best alternative, and report on evaluation results which show that the merging procedure helps to improve OCR quality.

Abstract

This paper describes our efforts in building a heritage corpus of Alpine texts. We have already digitized the yearbooks of the Swiss Alpine Club from 1864 until 1982.
This corpus poses special challenges since the yearbooks are multilingual and vary in orthography and layout. We discuss methods to improve OCR performance and experiment with combining two different OCR programs with the goal to reduce the number of OCR errors. We describe a merging procedure that uses a unigram language model trained on the uncorrected corpus itself to select the best alternative, and report on evaluation results which show that the merging procedure helps to improve OCR quality.

Statistics

Downloads

217 downloads since deposited on 30 Jul 2010
8 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:410 Linguistics
000 Computer science, knowledge & systems
Language:English
Event End Date:16 August 2010
Deposited On:30 Jul 2010 12:05
Last Modified:15 Dec 2017 08:06
Funders:Swiss National Science Foundation
Official URL:http://ilk.uvt.nl/LaTeCH2010/paperlist.html

Download

Download PDF  'Reducing OCR errors by combining two OCR systems'.
Preview
Content: Accepted Version
Filetype: PDF
Size: 1MB