Header

UZH-Logo

Maintenance Infos

Evaluation of HTR models without Ground Truth Material


Ströbel, Phillip; Clematide, Simon; Volk, Martin; Schwitter, Raphael; Hodel, Tobias; Schoch, David (2022). Evaluation of HTR models without Ground Truth Material. In: LREC 2022, Marseille, 21 June 2022 - 23 June 2022, European Language Resources Association.

Abstract

The evaluation of Handwritten Text Recognition (HTR) models during their development is straightforward: because HTR is a supervised problem, the usual data split into training, validation, and test data sets allows the evaluation of models in terms of accuracy or error rates. However, the evaluation process becomes tricky as soon as we switch from development to application. A compilation of a new (and forcibly smaller) ground truth (GT) from a sample of the data that we want to apply the model on and the subsequent evaluation of models thereon only provides hints about the quality of the recognised text, as do confidence scores (if available) the models return. Moreover, if we have several models at hand, we face a model selection problem since we want to obtain the best possible result during the application phase. This calls for GT-free metrics to select the best model, which is why we (re-)introduce and compare different metrics, from simple, lexicon-based to more elaborate ones using standard language models and masked language models (MLM). We show that MLM-based evaluation can compete with lexicon-based methods, with the advantage that large and multilingual transformers are readily available, thus making compiling lexical resources for other metrics superfluous.

Abstract

The evaluation of Handwritten Text Recognition (HTR) models during their development is straightforward: because HTR is a supervised problem, the usual data split into training, validation, and test data sets allows the evaluation of models in terms of accuracy or error rates. However, the evaluation process becomes tricky as soon as we switch from development to application. A compilation of a new (and forcibly smaller) ground truth (GT) from a sample of the data that we want to apply the model on and the subsequent evaluation of models thereon only provides hints about the quality of the recognised text, as do confidence scores (if available) the models return. Moreover, if we have several models at hand, we face a model selection problem since we want to obtain the best possible result during the application phase. This calls for GT-free metrics to select the best model, which is why we (re-)introduce and compare different metrics, from simple, lexicon-based to more elaborate ones using standard language models and masked language models (MLM). We show that MLM-based evaluation can compete with lexicon-based methods, with the advantage that large and multilingual transformers are readily available, thus making compiling lexical resources for other metrics superfluous.

Statistics

Downloads

24 downloads since deposited on 11 Jul 2022
24 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
06 Faculty of Arts > Zurich Center for Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Uncontrolled Keywords:Handwritten Text Recognition OCR Digital Humanities Ground Truth
Language:English
Event End Date:23 June 2022
Deposited On:11 Jul 2022 09:26
Last Modified:26 Feb 2023 10:10
Publisher:European Language Resources Association
OA Status:Green
Free access at:Official URL. An embargo period may apply.
Official URL:http://www.lrec-conf.org/proceedings/lrec2022/pdf/2022.lrec-1.467.pdf
  • Content: Published Version
  • Language: English