Header

UZH-Logo

Maintenance Infos

Crowdsourcing an OCR Gold Standard for a German and French Heritage Corpus


Clematide, Simon; Furrer, Lenz; Volk, Martin (2016). Crowdsourcing an OCR Gold Standard for a German and French Heritage Corpus. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), Portorož, Slovenia, 23 May 2016 - 28 May 2016, 975-982.

Abstract

Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been successfully applied to several historic text collections. We report on our crowd-correction platform Kokos, which we built to improve the OCR quality of the digitized yearbooks of the Swiss Alpine Club (SAC) from the 19th century. This multilingual heritage corpus consists of Alpine texts mainly written in German and French, all typeset in Antiqua font. Finding and engaging volunteers for correcting large amounts of pages into high quality text requires a carefully designed user interface, an easy-to-use workflow, and continuous efforts for keeping the participants motivated. More than 180,000 characters on about 21,000 pages were corrected by volunteers in about 7 months, achieving an OCR gold standard with a systematically evaluated accuracy of 99.7% on the word level. The crowdsourced OCR gold standard and the corresponding original OCR recognition results from Abbyy FineReader 7 for each page are available as a resource. Additionally, the scanned images (300 dpi) of all pages are included in order to facilitate tests with other OCR software.

Abstract

Crowdsourcing approaches for post-correction of OCR output (Optical Character Recognition) have been successfully applied to several historic text collections. We report on our crowd-correction platform Kokos, which we built to improve the OCR quality of the digitized yearbooks of the Swiss Alpine Club (SAC) from the 19th century. This multilingual heritage corpus consists of Alpine texts mainly written in German and French, all typeset in Antiqua font. Finding and engaging volunteers for correcting large amounts of pages into high quality text requires a carefully designed user interface, an easy-to-use workflow, and continuous efforts for keeping the participants motivated. More than 180,000 characters on about 21,000 pages were corrected by volunteers in about 7 months, achieving an OCR gold standard with a systematically evaluated accuracy of 99.7% on the word level. The crowdsourced OCR gold standard and the corresponding original OCR recognition results from Abbyy FineReader 7 for each page are available as a resource. Additionally, the scanned images (300 dpi) of all pages are included in order to facilitate tests with other OCR software.

Statistics

Citations

Altmetrics

Downloads

81 downloads since deposited on 05 Jul 2016
51 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Language:English
Event End Date:28 May 2016
Deposited On:05 Jul 2016 13:30
Last Modified:20 Sep 2018 04:15
Publisher:European Language Resources Association (ELRA)
ISBN:978-2-9517408-9-1
OA Status:Green
Free access at:Official URL. An embargo period may apply.
Official URL:http://www.lrec-conf.org/proceedings/lrec2016/pdf/917_Paper.pdf

Download

Download PDF  'Crowdsourcing an OCR Gold Standard for a German and French Heritage Corpus'.
Preview
Content: Published Version
Language: English
Filetype: PDF
Size: 460kB
Licence: Creative Commons: Attribution-NonCommercial 4.0 International (CC BY-NC 4.0)