UZH-Logo

Maintenance Infos

Strategies for reducing and correcting OCR error


Volk, Martin; Furrer, Lenz; Sennrich, Rico (2011). Strategies for reducing and correcting OCR error. In: Sporleder, Caroline; van den Bosch, Antal; Zervanou, Kalliopi. Language Technology for Cultural Heritage. Berlin: Springer, 3-22.

Abstract

In this paper we describe our efforts in reducing and correcting OCR errors in the context of building a large multilingual heritage corpus of Alpine texts which is based on digitizing the publications of various Alpine clubs. We have already digitized the yearbooks of the Swiss Alpine Club from its start in 1864 until 1995 with more than 75,000 pages resulting in 29 million running words. Since these books have come out continuously, they represent a unique basis for historical, cultural and linguistic research. We used commercial OCR systems for the conversion from the scanned images to searchable text. This poses several challenges. For example, the built-in lexicons of the OCR systems do not cover the 19th century German spelling, the Swiss German spelling variants and the plethora of toponyms that are characteristic of our text genre. We also realized that different OCR systems make different recognition errors. We therefore run two OCR systems over all our scanned pages and merge the output. Merging is especially tricky at spots where both systems result in partially correct word groups. We describe our strategies for reducing OCR errors by enlarging the systems’ lexicons and by two post-correction methods namely, merging the output of two OCR systems and auto- correction based on additional lexical resources.

In this paper we describe our efforts in reducing and correcting OCR errors in the context of building a large multilingual heritage corpus of Alpine texts which is based on digitizing the publications of various Alpine clubs. We have already digitized the yearbooks of the Swiss Alpine Club from its start in 1864 until 1995 with more than 75,000 pages resulting in 29 million running words. Since these books have come out continuously, they represent a unique basis for historical, cultural and linguistic research. We used commercial OCR systems for the conversion from the scanned images to searchable text. This poses several challenges. For example, the built-in lexicons of the OCR systems do not cover the 19th century German spelling, the Swiss German spelling variants and the plethora of toponyms that are characteristic of our text genre. We also realized that different OCR systems make different recognition errors. We therefore run two OCR systems over all our scanned pages and merge the output. Merging is especially tricky at spots where both systems result in partially correct word groups. We describe our strategies for reducing OCR errors by enlarging the systems’ lexicons and by two post-correction methods namely, merging the output of two OCR systems and auto- correction based on additional lexical resources.

Citations

Altmetrics

Downloads

1157 downloads since deposited on 03 Jan 2012
147 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Book Section, refereed, original work
Communities & Collections:03 Faculty of Economics > Department of Informatics
Dewey Decimal Classification:000 Computer science, knowledge & systems
Date:2011
Deposited On:03 Jan 2012 17:05
Last Modified:05 Apr 2016 15:19
Publisher:Springer
Series Name:Theory and Applications of Natural Language Processing
ISBN:978-3-642-20226-1
Additional Information:The final publication is available at www.springerlink.com
Publisher DOI:10.1007/978-3-642-20227-8_1
Official URL:http://www.springer.com/computer/ai/book/978-3-642-20226-1
Other Identification Number:merlin-id:6206
Permanent URL: http://doi.org/10.5167/uzh-54277

Download

[img]
Preview
Content: Accepted Version
Filetype: PDF
Size: 835kB
View at publisher

TrendTerms

TrendTerms displays relevant terms of the abstract of this publication and related documents on a map. The terms and their relations were extracted from ZORA using word statistics. Their timelines are taken from ZORA as well. The bubble size of a term is proportional to the number of documents where the term occurs. Red, orange, yellow and green colors are used for terms that occur in the current document; red indicates high interlinkedness of a term with other terms, orange, yellow and green decreasing interlinkedness. Blue is used for terms that have a relation with the terms in this document, but occur in other documents.
You can navigate and zoom the map. Mouse-hovering a term displays its timeline, clicking it yields the associated documents.

Author Collaborations