Header

UZH-Logo

Maintenance Infos

Semi-supervised Contextual Historical Text Normalization


Makarov, Peter; Clematide, Simon (2020). Semi-supervised Contextual Historical Text Normalization. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 1 July 2020, Association for Computational Linguistics.

Abstract

Historical text normalization, the task of mapping historical word forms to their modern counterparts, has recently attracted a lot of interest (Bollmann, 2019; Tang et al., 2018; Lusetti et al., 2018; Bollmann et al., 2018;Robertson and Goldwater, 2018; Bollmannet al., 2017; Korchagina, 2017). Yet, virtually all approaches suffer from the two limitations: 1) They consider a fully supervised setup, often with impractically large manually normalized datasets; 2) Normalization happens on words in isolation. By utilizing a simple generative normalization model and obtaining powerful contextualization from the target-side language model, we train accurate models with unlabeled historical data. In realistic training scenarios, our approach often leads to reduction in manually normalized data at the same accuracy levels.

Abstract

Historical text normalization, the task of mapping historical word forms to their modern counterparts, has recently attracted a lot of interest (Bollmann, 2019; Tang et al., 2018; Lusetti et al., 2018; Bollmann et al., 2018;Robertson and Goldwater, 2018; Bollmannet al., 2017; Korchagina, 2017). Yet, virtually all approaches suffer from the two limitations: 1) They consider a fully supervised setup, often with impractically large manually normalized datasets; 2) Normalization happens on words in isolation. By utilizing a simple generative normalization model and obtaining powerful contextualization from the target-side language model, we train accurate models with unlabeled historical data. In realistic training scenarios, our approach often leads to reduction in manually normalized data at the same accuracy levels.

Statistics

Citations

Dimensions.ai Metrics

Altmetrics

Downloads

34 downloads since deposited on 03 Feb 2021
12 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Language:English
Event End Date:1 July 2020
Deposited On:03 Feb 2021 07:06
Last Modified:27 Jan 2022 05:22
Publisher:Association for Computational Linguistics
OA Status:Hybrid
Free access at:Publisher DOI. An embargo period may apply.
Publisher DOI:https://doi.org/10.18653/v1/2020.acl-main.650
  • Content: Published Version
  • Licence: Creative Commons: Attribution 4.0 International (CC BY 4.0)