Header

UZH-Logo

Maintenance Infos

Spelling normalisation of Late Modern English: comparison and combination of VARD and character-based statistical machine translation


Schneider, Gerold (2020). Spelling normalisation of Late Modern English: comparison and combination of VARD and character-based statistical machine translation. In: Kytö, Merja; Smitterberg, Eric. Late Modern English: novel encounters. Amsterdam: John Benjamins Publishing, 243-268.

Abstract

To be able to profit from natural language processing (NLP) tools for analysing historical text, an important step is spelling normalisation. We first compare and second combine two different approaches: on the one hand VARD, a rule-based system which is based on dictionary lookup and rules with non-probabilistic but trainable weights; on the other hand a language-independent approach to spelling normalisation based on statistical machine translation (SMT) techniques. The rule-based system reaches the best accuracy, up to 94% precision at 74% recall, while the SMT system improves each tested period. We obtain the best system by combining both approaches. Re-training VARD on specific time-periods and domains is beneficial, and both systems benefit from a language sequence model using collocation strength.

Abstract

To be able to profit from natural language processing (NLP) tools for analysing historical text, an important step is spelling normalisation. We first compare and second combine two different approaches: on the one hand VARD, a rule-based system which is based on dictionary lookup and rules with non-probabilistic but trainable weights; on the other hand a language-independent approach to spelling normalisation based on statistical machine translation (SMT) techniques. The rule-based system reaches the best accuracy, up to 94% precision at 74% recall, while the SMT system improves each tested period. We obtain the best system by combining both approaches. Re-training VARD on specific time-periods and domains is beneficial, and both systems benefit from a language sequence model using collocation strength.

Statistics

Citations

Dimensions.ai Metrics

Altmetrics

Downloads

1 download since deposited on 17 Feb 2020
0 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Book Section, refereed, original work
Communities & Collections:06 Faculty of Arts > English Department
06 Faculty of Arts > Institute of Computational Linguistics
08 Research Priority Programs > Digital Society Initiative
Dewey Decimal Classification:820 English & Old English literatures
Uncontrolled Keywords:Late Modern English, Spelling Normalisation, VARD, Ensemble Learning, Character-based Machine Translation
Language:English
Date:March 2020
Deposited On:17 Feb 2020 14:32
Last Modified:07 Apr 2020 07:26
Publisher:John Benjamins Publishing
Series Name:Studies in language companion series
Number:214
ISSN:0165-7763
ISBN:9789027261434
OA Status:Closed
Publisher DOI:https://doi.org/10.1075/slcs.214.11sch
Related URLs:https://benjamins.com/catalog/slcs.214.11sch (Publisher)

Download

Closed Access: Download allowed only for UZH members