Header

UZH-Logo

Maintenance Infos

Challenges in the alignment, management and exploitation of large and richly annotated multi-parallel corpora


Graën, Johannes; Clematide, Simon (2015). Challenges in the alignment, management and exploitation of large and richly annotated multi-parallel corpora. In: 3rd Workshop on the Challenges in the Management of Large Corpora, Lancaster, 20 July 2015. Institut für Deutsche Sprache, 15-20.

Abstract

The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as part-of-speech tagging, lemmatisation, chunking, and dependency parsing facilitate precise querying of linguistic properties and can be used to extend word alignment to sub-sentential groups. Such highly inter-connected data is stored in a relational database to allow for efficient retrieval and linguistic data mining, which may include the statistics-based selection of good example sentences. The varying information needs of contrastive linguists require a flexible linguistic query language for ad hoc searches. Such queries in the format of generalised treebank query languages will be automatically translated into SQL queries.

Abstract

The availability of large multi-parallel corpora offers an enormous wealth of material to contrastive corpus linguists, translators and language learners, if we can exploit the data properly. Necessary preparation steps include sentence and word alignment across multiple languages. Additionally, linguistic annotation such as part-of-speech tagging, lemmatisation, chunking, and dependency parsing facilitate precise querying of linguistic properties and can be used to extend word alignment to sub-sentential groups. Such highly inter-connected data is stored in a relational database to allow for efficient retrieval and linguistic data mining, which may include the statistics-based selection of good example sentences. The varying information needs of contrastive linguists require a flexible linguistic query language for ad hoc searches. Such queries in the format of generalised treebank query languages will be automatically translated into SQL queries.

Statistics

Downloads

62 downloads since deposited on 04 Aug 2015
5 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Language:English
Event End Date:20 July 2015
Deposited On:04 Aug 2015 09:25
Last Modified:27 Nov 2020 07:23
Publisher:Institut für Deutsche Sprache
Additional Information:URN: urn:nbn:de:bsz:mh39-38261
OA Status:Green
Free access at:Official URL. An embargo period may apply.
Official URL:http://ids-pub.bsz-bw.de/files/3826/Graen_Clematide_Challenges_in_the_Alignment_management_and_exploitation_2015.pdf
Related URLs:http://corpora.ids-mannheim.de/cmlc.html
http://ids-pub.bsz-bw.de/files/3826/cmlc3-proceedings_2015.pdf
http://ids-pub.bsz-bw.de/frontdoor/index/index/docId/3826
  • Content: Published Version
  • Licence: Creative Commons: Attribution-NonCommercial-NoDerivs 3.0 Unported (CC BY-NC-ND 3.0)