Quick Search:

uzh logo
Browse by:
bullet
bullet
bullet
bullet

Zurich Open Repository and ArchiveĀ 

Permanent URL to this publication: http://dx.doi.org/10.5167/uzh-63885

Plamada, Magdalena; Volk, Martin (2012). Towards a Wikipedia-extracted alpine corpus. In: The Fifth Workshop on Building and Using Comparable Corpora, Istanbul, Turkey, 26 May 2012 - 26 May 2012.

[img]
Preview
Published Version
PDF
142Kb

Abstract

This paper describes a method for extracting parallel sentences from comparable texts. We present the main challenges in creating a German-French corpus for the Alpine domain. We demonstrate that it is difficult to use the Wikipedia categorization for the extraction of domain-specific articles from Wikipedia, therefore we introduce an alternative information retrieval approach. Sentence alignment algorithms were used to identify semantically equivalent sentences across the Wikipedia articles. Using this approach, we create a corpus of sentence-aligned Alpine texts, which is evaluated both manually and automatically. Results show that even a small collection of extracted texts (approximately 10000 sentence pairs) can partially improve the performance of a state-of-the-art statistical machine translation system. Thus, the approach is worth pursuing on a larger scale, as well as for other language pairs and domains.

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
DDC:000 Computer science, knowledge & systems
410 Linguistics
Language:English
Event End Date:26 May 2012
Deposited On:27 Jul 2012 09:06
Last Modified:21 Oct 2012 04:06
Free access at:Official URL. An embargo period may apply.
Official URL:http://www.lrec-conf.org/proceedings/lrec2012/workshops/16.BUCC2012%20Proceedings.pdf
Related URLs:http://hnk.ffzg.hr/5bucc2012/
Citations:Google Scholarā„¢

Users (please log in): suggest update or correction for this item

Repository Staff Only: item control page