Header

UZH-Logo

Maintenance Infos

Challenges in the alignment, management and exploitation of large and richly annotated multi-parallel corpora


Graën, Johannes; Clematide, Simon (2015). Challenges in the alignment, management and exploitation of large and richly annotated multi-parallel corpora. In: 3rd Workshop on the Challenges in the Management of Large Corpora, Lancaster, 20 July 2015 - 20 July 2015, 15-20.

Abstract

The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as part-of-speech tagging, lemmatisation, chunking, and dependency parsing facilitate precise querying of linguistic properties and can be used to extend word alignment to sub-sentential groups. Such highly inter-connected data is stored in a relational database to allow for efficient retrieval and linguistic data mining, which may include the statistics-based selection of good example sentences. The varying information needs of contrastive linguists require a flexible linguistic query language for ad hoc searches. Such queries in the format of generalised treebank query languages will be automatically translated into SQL queries.

Abstract

The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as part-of-speech tagging, lemmatisation, chunking, and dependency parsing facilitate precise querying of linguistic properties and can be used to extend word alignment to sub-sentential groups. Such highly inter-connected data is stored in a relational database to allow for efficient retrieval and linguistic data mining, which may include the statistics-based selection of good example sentences. The varying information needs of contrastive linguists require a flexible linguistic query language for ad hoc searches. Such queries in the format of generalised treebank query languages will be automatically translated into SQL queries.

Statistics

Downloads

27 downloads since deposited on 04 Aug 2015
11 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Language:English
Event End Date:20 July 2015
Deposited On:04 Aug 2015 09:25
Last Modified:21 Nov 2017 17:59
Publisher:Institut für Deutsche Sprache
Additional Information:URN: urn:nbn:de:bsz:mh39-38261
Free access at:Official URL. An embargo period may apply.
Official URL:http://ids-pub.bsz-bw.de/files/3826/Graen_Clematide_Challenges_in_the_Alignment_management_and_exploitation_2015.pdf
Related URLs:http://corpora.ids-mannheim.de/cmlc.html
http://ids-pub.bsz-bw.de/files/3826/cmlc3-proceedings_2015.pdf
http://ids-pub.bsz-bw.de/frontdoor/index/index/docId/3826

Download

Download PDF  'Challenges in the alignment, management and exploitation of large and richly annotated multi-parallel corpora'.
Preview
Content: Published Version
Filetype: PDF
Size: 222kB
Licence: Creative Commons: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)