Header

UZH-Logo

Maintenance Infos

Character-level Chinese-English Translation through ASCII Encoding


Nikolov, Nikola I; Hu, Yuhuang; Tan, Mi Xue; Hahnloser, Richard H R (2018). Character-level Chinese-English Translation through ASCII Encoding. arXiv.org 1805.03330, Institute of Neuroinformatics.

Abstract

Character-level Neural Machine Translation (NMT) models have recently achieved impressive results on many language pairs. They mainly do well for Indo-European language pairs, where the languages share the same writing system. However, for translating between Chinese and English, the gap between the two different writing systems poses a major challenge because of a lack of systematic correspondence between the individual linguistic units. In this paper, we enable character-level NMT for Chinese, by breaking down Chinese characters into linguistic units similar to that of Indo-European languages. We use the Wubi encoding scheme, which preserves the original shape and semantic information of the characters, while also being reversible. We show promising results from training Wubi-based models on the character- and subword-level with recurrent as well as convolutional models.

Abstract

Character-level Neural Machine Translation (NMT) models have recently achieved impressive results on many language pairs. They mainly do well for Indo-European language pairs, where the languages share the same writing system. However, for translating between Chinese and English, the gap between the two different writing systems poses a major challenge because of a lack of systematic correspondence between the individual linguistic units. In this paper, we enable character-level NMT for Chinese, by breaking down Chinese characters into linguistic units similar to that of Indo-European languages. We use the Wubi encoding scheme, which preserves the original shape and semantic information of the characters, while also being reversible. We show promising results from training Wubi-based models on the character- and subword-level with recurrent as well as convolutional models.

Statistics

Downloads

20 downloads since deposited on 08 Mar 2019
19 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Working Paper
Communities & Collections:07 Faculty of Science > Institute of Neuroinformatics
Dewey Decimal Classification:570 Life sciences; biology
Language:English
Date:2018
Deposited On:08 Mar 2019 11:15
Last Modified:25 Sep 2019 00:27
Publisher:Arxiv - Computer Science
Series Name:arXiv.org
ISSN:2331-8422
OA Status:Green
Free access at:Official URL. An embargo period may apply.
Official URL:https://arxiv.org/abs/1805.03330

Download

Green Open Access

Download PDF  'Character-level Chinese-English Translation through ASCII Encoding'.
Preview
Content: Accepted Version
Filetype: PDF
Size: 404kB
Licence: Creative Commons: Attribution 4.0 International (CC BY 4.0)