Permanent URL to this publication: http://dx.doi.org/10.5167/uzh-38464
Sennrich, R; Volk, M (2010). MT-based sentence alignment for OCR-generated parallel texts. In: The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, 31 October 2010 - 4 November 2010.
The performance of current sentence alignment tools varies according to the to-be-aligned texts. We have found existing tools unsuitable for hard-to-align parallel texts and describe an alternative alignment algorithm. The basic idea is to use machine translations of a text and BLEU as a similarity score to find reliable alignments which are used as anchor points. The gaps between these anchor points are then filled using BLEU-based and length-based heuristics. We show that this approach outperforms state-of-the-art algorithms in our alignment task, and that this improvement in alignment quality translates into better SMT performance. Furthermore, we show that even length-based alignment algorithms profit from having a machine translation as a point of comparison.
88 downloads since deposited on 14 Dec 2010
29 downloads since 12 months
|Item Type:||Conference or Workshop Item (Paper), refereed, original work|
|Communities & Collections:||06 Faculty of Arts > Institute of Computational Linguistics|
|DDC:||000 Computer science, knowledge & systems
|Event End Date:||4 November 2010|
|Deposited On:||14 Dec 2010 13:48|
|Last Modified:||09 Jul 2012 04:29|
|Funders:||Swiss National Science Foundation|
Users (please log in): suggest update or correction for this item
Repository Staff Only: item control page