Header

UZH-Logo

Maintenance Infos

How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR


Ströbel, Phillip Benjamin; Clematide, Simon; Volk, Martin (2020). How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR. In: Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, 1 May 2020 - 2 May 2020. ACL Anthology, 3551-3559.

Abstract

Recent advances in Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) have led to more accurate textrecognition of historical documents. The Digital Humanities heavily profit from these developments, but they still struggle whenchoosing from the plethora of OCR systems available on the one hand and when defining workflows for their projects on the other hand.In this work, we present our approach to build a ground truth for a historical German-language newspaper published in black letter. Wealso report how we used it to systematically evaluate the performance of different OCR engines. Additionally, we used this ground truthto make an informed estimate as to how much data is necessary to achieve high-quality OCR results. The outcomes of our experimentsshow that HTR architectures can successfully recognise black letter text and that a ground truth size of 50 newspaper pages suffices toachieve good OCR accuracy. Moreover, our models perform equally well on data they have not seen during training, which means thatadditional manual correction for diverging data is superfluous.

Abstract

Recent advances in Optical Character Recognition (OCR) and Handwritten Text Recognition (HTR) have led to more accurate textrecognition of historical documents. The Digital Humanities heavily profit from these developments, but they still struggle whenchoosing from the plethora of OCR systems available on the one hand and when defining workflows for their projects on the other hand.In this work, we present our approach to build a ground truth for a historical German-language newspaper published in black letter. Wealso report how we used it to systematically evaluate the performance of different OCR engines. Additionally, we used this ground truthto make an informed estimate as to how much data is necessary to achieve high-quality OCR results. The outcomes of our experimentsshow that HTR architectures can successfully recognise black letter text and that a ground truth size of 50 newspaper pages suffices toachieve good OCR accuracy. Moreover, our models perform equally well on data they have not seen during training, which means thatadditional manual correction for diverging data is superfluous.

Statistics

Downloads

18 downloads since deposited on 21 Jan 2021
18 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Scopus Subject Areas:Social Sciences & Humanities > Language and Linguistics
Social Sciences & Humanities > Education
Social Sciences & Humanities > Library and Information Sciences
Social Sciences & Humanities > Linguistics and Language
Language:English
Event End Date:2 May 2020
Deposited On:21 Jan 2021 17:18
Last Modified:29 Oct 2021 07:06
Publisher:ACL Anthology
OA Status:Green
Free access at:Official URL. An embargo period may apply.
Official URL:https://www.aclweb.org/anthology/2020.lrec-1.436
Project Information:
  • : FunderSchweizerischer Nationalfonds
  • : Grant IDCR-SII5 173719.
  • : Project Titleimpresso - Media Monitoring of the Past
  • : Project Websitehttps://www.impresso-project.ch

Download

Green Open Access

Download PDF  'How Much Data Do You Need? About the Creation of a Ground Truth for Black Letter and the Effectiveness of Neural OCR'.
Preview
Content: Published Version
Filetype: PDF
Size: 1MB