Header

UZH-Logo

Maintenance Infos

Treatment of Markup in Statistical Machine Translation


Müller, Mathias (2017). Treatment of Markup in Statistical Machine Translation. In: Third Workshop on Discourse in Machine Translation, Copenhagen, Denmark, 8 September 2017. Association of Computational Linguistics, 36-46.

Abstract

We present work on handling XML markup in Statistical Machine Translation (SMT). The methods we propose can be used to effectively preserve markup (for instance inline formatting or structure) and to place markup correctly in a machine-translated segment. We evaluate our approaches with parallel data that naturally contains markup or where markup was inserted to create synthetic examples. In our experiments, hybrid reinsertion has proven the most accurate method to handle markup, while alignment masking and alignment reinsertion should be regarded as viable alternatives. We provide implementations of all the methods described and they are freely available as an open-source framework.

Abstract

We present work on handling XML markup in Statistical Machine Translation (SMT). The methods we propose can be used to effectively preserve markup (for instance inline formatting or structure) and to place markup correctly in a machine-translated segment. We evaluate our approaches with parallel data that naturally contains markup or where markup was inserted to create synthetic examples. In our experiments, hybrid reinsertion has proven the most accurate method to handle markup, while alignment masking and alignment reinsertion should be regarded as viable alternatives. We provide implementations of all the methods described and they are freely available as an open-source framework.

Statistics

Citations

Downloads

94 downloads since deposited on 03 Oct 2017
11 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Other), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Language:English
Event End Date:8 September 2017
Deposited On:03 Oct 2017 13:44
Last Modified:13 Oct 2023 13:36
Publisher:Association of Computational Linguistics
OA Status:Green
Free access at:Official URL. An embargo period may apply.
Official URL:http://www.aclweb.org/anthology/W/W17/W17-4804.pdf
Related URLs:https://gitlab.cl.uzh.ch/mt/mtrain