Header

UZH-Logo

Maintenance Infos

Neural text normalization with adapted decoding and POS features


Stark, Elisabeth; Ruzsics, Tatyana; Lusetti, Massimo; Göhring, Anne; Samardžić, Tanja (2019). Neural text normalization with adapted decoding and POS features. Natural Language Engineering, 25(5):585-605.

Abstract

Text normalization is the task of mapping noncanonical language, typical of speech transcription and computer-mediated communication, to a standardized writing. This task is especially important for languages such as Swiss German, with strong regional variation and no written standard. In this paper, we propose a novel solution for normalizing Swiss German WhatsApp messages using the encoder–decoder neural machine translation (NMT) framework. We enhance the performance of a plain character-level NMT model with the integration of a word-level language model and linguistic features in the form of part-of-speech (POS) tags. The two components are intended to improve the performance by addressing two specific issues: the former is intended to improve the fluency of the predicted sequences, whereas the latter aims at resolving cases of word-level ambiguity. Our systematic comparison shows that our proposed solution results in an improvement over a plain NMT system and also over a comparable character-level statistical machine translation system, considered the state of the art in this task till recently. We perform a thorough analysis of the compared systems’ output, showing that our two components produce indeed the intended, complementary improvements.

Abstract

Text normalization is the task of mapping noncanonical language, typical of speech transcription and computer-mediated communication, to a standardized writing. This task is especially important for languages such as Swiss German, with strong regional variation and no written standard. In this paper, we propose a novel solution for normalizing Swiss German WhatsApp messages using the encoder–decoder neural machine translation (NMT) framework. We enhance the performance of a plain character-level NMT model with the integration of a word-level language model and linguistic features in the form of part-of-speech (POS) tags. The two components are intended to improve the performance by addressing two specific issues: the former is intended to improve the fluency of the predicted sequences, whereas the latter aims at resolving cases of word-level ambiguity. Our systematic comparison shows that our proposed solution results in an improvement over a plain NMT system and also over a comparable character-level statistical machine translation system, considered the state of the art in this task till recently. We perform a thorough analysis of the compared systems’ output, showing that our two components produce indeed the intended, complementary improvements.

Statistics

Citations

Altmetrics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Romance Studies
Dewey Decimal Classification:800 Literature, rhetoric & criticism
470 Latin & Italic languages
410 Linguistics
440 French & related languages
460 Spanish & Portuguese languages
450 Italian, Romanian & related languages
Scopus Subject Areas:Physical Sciences > Software
Social Sciences & Humanities > Language and Linguistics
Social Sciences & Humanities > Linguistics and Language
Physical Sciences > Artificial Intelligence
Language:English
Date:September 2019
Deposited On:27 Nov 2019 08:58
Last Modified:29 Jul 2020 11:56
Publisher:Cambridge University Press
ISSN:1351-3249
OA Status:Closed
Publisher DOI:https://doi.org/10.1017/S1351324919000391
Related URLs:https://www.cambridge.org/core/journals/natural-language-engineering (Publisher)

Download

Full text not available from this repository.
View at publisher

Get full-text in a library