Header

UZH-Logo

Maintenance Infos

Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts


Schneider, Gerold; Pettersson, Eva; Percillier, Michael (2017). Comparing Rule-based and SMT-based Spelling Normalisation for English Historical Texts. In: Proceedings of the NoDaLiDa 2017 Workshop on Processing Historical Language, Gothenburg, 22 May 2017. Linköping University Electronic Press, Linköpings universitet, 40-46.

Abstract

To be able to use existing natural language processing tools for analysing historical text, an important preprocessing step is spelling normalisation, converting the original spelling to present-day spelling, before applying tools such as taggers and parsers. In this paper, we compare a probablistic, language-independent approach to spelling normalisation based on statistical machine translation (SMT) techniques, to a rule-based system combining dictionary lookup with rules and non-probabilistic weights. The rule-based system reaches the best accuracy, up to 94% precision at 74% recall, while the SMT system improves each tested period.

Abstract

To be able to use existing natural language processing tools for analysing historical text, an important preprocessing step is spelling normalisation, converting the original spelling to present-day spelling, before applying tools such as taggers and parsers. In this paper, we compare a probablistic, language-independent approach to spelling normalisation based on statistical machine translation (SMT) techniques, to a rule-based system combining dictionary lookup with rules and non-probabilistic weights. The rule-based system reaches the best accuracy, up to 94% precision at 74% recall, while the SMT system improves each tested period.

Statistics

Downloads

62 downloads since deposited on 30 May 2017
8 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), original work
Communities & Collections:06 Faculty of Arts > English Department
06 Faculty of Arts > Institute of Computational Linguistics
06 Faculty of Arts > Zurich Center for Linguistics
Dewey Decimal Classification:820 English & Old English literatures
Language:English
Event End Date:22 May 2017
Deposited On:30 May 2017 13:45
Last Modified:03 Dec 2020 15:19
Publisher:Linköping University Electronic Press, Linköpings universitet
Number:133
OA Status:Green
Free access at:Official URL. An embargo period may apply.
Official URL:http://www.ep.liu.se/ecp/article.asp?issue=133&article=008&volume=#
Related URLs:https://spraakbanken.gu.se/swe/processing-historical-language (Organisation)
  • Content: Published Version