Permanent URL to this publication: http://dx.doi.org/10.5167/uzh-38464
Sennrich, R; Volk, M (2010). MT-based sentence alignment for OCR-generated parallel texts. In: The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, 31 October 2010 - 04 November 2010.
| PDF 1176Kb |
Abstract
The performance of current sentence alignment tools varies according to the to-be-aligned texts. We have found existing tools unsuitable for hard-to-align parallel texts and describe an alternative alignment algorithm. The basic idea is to use machine translations of a text and BLEU as a similarity score to find reliable alignments which are used as anchor points. The gaps between these anchor points are then filled using BLEU-based and length-based heuristics. We show that this approach outperforms state-of-the-art algorithms in our alignment task, and that this improvement in alignment quality translates into better SMT performance. Furthermore, we show that even length-based alignment algorithms profit from having a machine translation as a point of comparison.
| Item Type: | Conference or Workshop Item (Paper), refereed, original work |
|---|---|
| Communities & Collections: | 06 Faculty of Arts > Institute of Computational Linguistics |
| DDC: | 000 Computer science, knowledge & systems 410 Linguistics |
| Language: | English |
| Event End Date: | 04 November 2010 |
| Deposited On: | 14 Dec 2010 14:48 |
| Last Modified: | 09 Jul 2012 06:29 |
| Funders: | Swiss National Science Foundation |
| Official URL: | http://amta2010.amtaweb.org/AMTA/papers/2-14-SennrichVolk.pdf |
Users (please log in): suggest update or correction for this item
Repository Staff Only: item control page