Header

UZH-Logo

Maintenance Infos

Parallel subtitle corpora and their applications in machine translation and translatology


Bywood, Lindsay; Volk, Martin; Fishel, Mark; Georgakopoulou, Panayota (2013). Parallel subtitle corpora and their applications in machine translation and translatology. Perspectives: Studies in Translatology, 21(4):595-610.

Abstract

SUMAT is a project funded through the EU ICT Policy Support Programme (2011–2014). It involves four subtitling companies (InVision, DDS, Titelbild, VSI) and five technical partners (ALS, ATC, TextShuttle, University of Maribor, Vicomtech).For the SUMAT project, translated subtitles for seven language pairs have been collected. Four subtitling companies have contributed to this effort, which has so far resulted in collections numbering between 200,000 and 2 million subtitles per language pair. This paper describes the process of converting, classifying and aligning the subtitles. Conversion to a common text format and cross-language alignment were automatically done, using specially built converters, whilst classifying the subtitles according to text genre was a manual process, performed by the teams harvesting the subtitles.The resulting subtitle corpora are perfectly suited for various applications. The focus of the SUMAT project is to use them as training material for statistical machine translation systems, and this paper will report on the initial experiences with some of the language pairs. In addition, the parallel corpora may serve as input data for parallel concordancing systems. As part of the project, a small prototype has been built which shows how word-aligned parallel subtitles offer new insights for translation science.

Abstract

SUMAT is a project funded through the EU ICT Policy Support Programme (2011–2014). It involves four subtitling companies (InVision, DDS, Titelbild, VSI) and five technical partners (ALS, ATC, TextShuttle, University of Maribor, Vicomtech).For the SUMAT project, translated subtitles for seven language pairs have been collected. Four subtitling companies have contributed to this effort, which has so far resulted in collections numbering between 200,000 and 2 million subtitles per language pair. This paper describes the process of converting, classifying and aligning the subtitles. Conversion to a common text format and cross-language alignment were automatically done, using specially built converters, whilst classifying the subtitles according to text genre was a manual process, performed by the teams harvesting the subtitles.The resulting subtitle corpora are perfectly suited for various applications. The focus of the SUMAT project is to use them as training material for statistical machine translation systems, and this paper will report on the initial experiences with some of the language pairs. In addition, the parallel corpora may serve as input data for parallel concordancing systems. As part of the project, a small prototype has been built which shows how word-aligned parallel subtitles offer new insights for translation science.

Statistics

Citations

Dimensions.ai Metrics
5 citations in Web of Science®
3 citations in Scopus®
5 citations in Microsoft Academic
Google Scholar™

Altmetrics

Downloads

3 downloads since deposited on 23 Dec 2013
0 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Language:English
Date:2013
Deposited On:23 Dec 2013 08:28
Last Modified:16 Feb 2018 18:41
Publisher:Taylor & Francis
ISSN:0907-676X
OA Status:Closed
Publisher DOI:https://doi.org/10.1080/0907676X.2013.831920

Download