UZH-Logo

Maintenance Infos

Domain adaptation for translation models in statistical machine translation - Zurich Open Repository and Archive


Sennrich, Rico. Domain adaptation for translation models in statistical machine translation. 2013, University of Zurich, Faculty of Arts.

Abstract

We investigate methods to adapt translation models in SMT to a specific target domain. We discuss two major problems, unknown words because of data sparseness in the (in-domain) training data, and ambiguities arising from out-of-domain parallel texts with different domain-specific translations. We propose novel solutions to both problems.
The main contributions of this thesis are as follows:
* We present a novel translation model architecture that supports domain adaptation at decoding time from a vector of component models. The combination is implemented through instance weighting, and all statistics necessary for the computation of translation probabilities are stored in the models.
* We present an architecture to combine multiple MT systems, using techniques and ideas from domain adaptation. The hypotheses by external MT systems are treated as out-of-domain knowledge, and combined with in-domain data through instance weighting.
* We introduce a sentence alignment algorithm that is able to robustly align even noisy parallel texts. We found that higher-quality sentence alignment of the in-domain parallel text has a significant effect on translation quality in our target domain.
* We propose new translation model features that express how flexible, or general, translation units are, in order to prevent translations that only occur in the context of multiword expressions from being overgeneralised.

Abstract

We investigate methods to adapt translation models in SMT to a specific target domain. We discuss two major problems, unknown words because of data sparseness in the (in-domain) training data, and ambiguities arising from out-of-domain parallel texts with different domain-specific translations. We propose novel solutions to both problems.
The main contributions of this thesis are as follows:
* We present a novel translation model architecture that supports domain adaptation at decoding time from a vector of component models. The combination is implemented through instance weighting, and all statistics necessary for the computation of translation probabilities are stored in the models.
* We present an architecture to combine multiple MT systems, using techniques and ideas from domain adaptation. The hypotheses by external MT systems are treated as out-of-domain knowledge, and combined with in-domain data through instance weighting.
* We introduce a sentence alignment algorithm that is able to robustly align even noisy parallel texts. We found that higher-quality sentence alignment of the in-domain parallel text has a significant effect on translation quality in our target domain.
* We propose new translation model features that express how flexible, or general, translation units are, in order to prevent translations that only occur in the context of multiword expressions from being overgeneralised.

Downloads

358 downloads since deposited on 14 Jan 2014
93 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Dissertation
Referees:Volk M, Schwenk H
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Language:English
Date:2013
Deposited On:14 Jan 2014 15:50
Last Modified:05 Apr 2016 17:23
Number of Pages:148

Download

Preview Icon on Download
Preview
Content: Published Version
Language: English
Filetype: PDF
Size: 976kB

TrendTerms

TrendTerms displays relevant terms of the abstract of this publication and related documents on a map. The terms and their relations were extracted from ZORA using word statistics. Their timelines are taken from ZORA as well. The bubble size of a term is proportional to the number of documents where the term occurs. Red, orange, yellow and green colors are used for terms that occur in the current document; red indicates high interlinkedness of a term with other terms, orange, yellow and green decreasing interlinkedness. Blue is used for terms that have a relation with the terms in this document, but occur in other documents.
You can navigate and zoom the map. Mouse-hovering a term displays its timeline, clicking it yields the associated documents.

Author Collaborations