Header

UZH-Logo

Maintenance Infos

Challenges in the Management of Large Corpora (CMLC-6)


Challenges in the Management of Large Corpora (CMLC-6). Edited by: Banski, Piotr; Kupietz, Marc; Barbaresi, Adrien; Biber, Hanno; Breiteneder, Evelyn; Clematide, Simon; Witt, Andreas (2018). Paris: European Language Resources Association (ELRA).

Abstract

Large corpora require careful design, licensing, collecting, cleaning, encoding, annotation, manage- ment, storage, retrieval, analysis, and curation to unfold their potential for a wide range of research questions and users, across a number of disciplines. Apart from the usual CMLC topics that fall into these areas, the 6th edition of the CMLC workshop features a special focus on corpus query and anal- ysis systems and specifically on goals concerning their interoperability.
In the past 5 years, a whole new generation of corpus query engines that overcome limitations on the number of tokens and annotation layers has started to emerge at several research centers. While there seems to be a consensus that there can be no single corpus tool that fulfills the need of all communities and that a degree of heterogeneity is required, the time seems ripe to discuss whether (further, unre- stricted) divergence should be avoided in order to allow for some interoperability and reusability – and how this can be achieved. The two most prominent areas where interoperability seems highly desirable are query languages and software components for corpus analysis. The former issue is already partially addressed by the proposed ISO standard Corpus Query Lingua Franca (CQLF). Components for corpus analysis and further processing of results (e.g. for visualization), on the other hand, should in an ideal world be exchangeable and reusable across different platforms, not only to avoid redundancies, but also to foster replicability and a canonization of methodology in NLP and corpus linguistics.
The 6th edition of the workshop is meant to address these issues, notably by including an expert panel discussion with representatives of tool development teams and power users.

Abstract

Large corpora require careful design, licensing, collecting, cleaning, encoding, annotation, manage- ment, storage, retrieval, analysis, and curation to unfold their potential for a wide range of research questions and users, across a number of disciplines. Apart from the usual CMLC topics that fall into these areas, the 6th edition of the CMLC workshop features a special focus on corpus query and anal- ysis systems and specifically on goals concerning their interoperability.
In the past 5 years, a whole new generation of corpus query engines that overcome limitations on the number of tokens and annotation layers has started to emerge at several research centers. While there seems to be a consensus that there can be no single corpus tool that fulfills the need of all communities and that a degree of heterogeneity is required, the time seems ripe to discuss whether (further, unre- stricted) divergence should be avoided in order to allow for some interoperability and reusability – and how this can be achieved. The two most prominent areas where interoperability seems highly desirable are query languages and software components for corpus analysis. The former issue is already partially addressed by the proposed ISO standard Corpus Query Lingua Franca (CQLF). Components for corpus analysis and further processing of results (e.g. for visualization), on the other hand, should in an ideal world be exchangeable and reusable across different platforms, not only to avoid redundancies, but also to foster replicability and a canonization of methodology in NLP and corpus linguistics.
The 6th edition of the workshop is meant to address these issues, notably by including an expert panel discussion with representatives of tool development teams and power users.

Statistics

Altmetrics

Downloads

5 downloads since deposited on 25 Jan 2019
5 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Edited Scientific Work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Language:English
Date:May 2018
Deposited On:25 Jan 2019 13:14
Last Modified:17 Sep 2019 19:57
Publisher:European Language Resources Association (ELRA)
Number of Pages:43
ISBN:979-10-95546-14-6
OA Status:Green
Free access at:Official URL. An embargo period may apply.
Official URL:http://lrec-conf.org/workshops/lrec2018/W17/index.html

Download

Download PDF  'Challenges in the Management of Large Corpora (CMLC-6)'.
Preview
Content: Published Version
Language: English
Filetype: PDF
Size: 2MB
Licence: Creative Commons: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)