Navigation auf zora.uzh.ch

Search ZORA

ZORA (Zurich Open Repository and Archive)

How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR

Ströbel, Phillip Benjamin; Clematide, Simon; Volk, Martin (2020). How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR. In: Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, 1 May 2020 - 2 May 2020. ACL Anthology, 3551-3559.

Abstract

Recent advances in Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) have led to more accurate textrecognition of historical documents. The Digital Humanities heavily profit from these developments, but they still struggle whenchoosing from the plethora of OCR systems available on the one hand and when defining workflows for their projects on the other hand.In this work, we present our approach to build a ground truth for a historical German-language newspaper published in black letter. Wealso report how we used it to systematically evaluate the performance of different OCR engines. Additionally, we used this ground truthto make an informed estimate as to how much data is necessary to achieve high-quality OCR results. The outcomes of our experimentsshow that HTR architectures can successfully recognise black letter text and that a ground truth size of 50 newspaper pages suffices toachieve good OCR accuracy. Moreover, our models perform equally well on data they have not seen during training, which means thatadditional manual correction for diverging data is superfluous.

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Scopus Subject Areas:Social Sciences & Humanities > Language and Linguistics
Social Sciences & Humanities > Education
Social Sciences & Humanities > Library and Information Sciences
Social Sciences & Humanities > Linguistics and Language
Language:English
Event End Date:2 May 2020
Deposited On:21 Jan 2021 17:18
Last Modified:24 Apr 2022 07:22
Publisher:ACL Anthology
OA Status:Green
Free access at:Official URL. An embargo period may apply.
Official URL:https://www.aclweb.org/anthology/2020.lrec-1.436
Project Information:
  • Funder: Schweizerischer Nationalfonds
  • Grant ID: CR-SII5 173719.
  • Project Title: impresso - Media Monitoring of the Past
  • : Project Websitehttps://www.impresso-project.ch
Download PDF  'How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR'.
Preview
  • Content: Published Version

Metadata Export

Statistics

Citations

3 citations in Web of Science®
9 citations in Scopus®
Google Scholar™

Downloads

207 downloads since deposited on 21 Jan 2021
70 downloads since 12 months
Detailed statistics

Authors, Affiliations, Collaborations

Similar Publications