Header

UZH-Logo

Maintenance Infos

Challenges in the Management of Large Corpora (CMLC-7) 2019


Challenges in the Management of Large Corpora (CMLC-7) 2019. Edited by: Banski, Piotr; Barbaresi, Adrien; Biber, Hanno; Breiteneder, Evelyn; Clematide, Simon; Kupietz, Marc; Lüngen, Harald; Iliadi, Caroline (2019). Mannheim: Leibniz-Institut für Deutsche Sprache.

Abstract

Large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation to be of use for a wide range of research questions and to users across a number of disciplines. A growing number of national and other very large corpora are being made available, many historical archives are being digitised, numerous publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media. A number of key themes and questions emerge that are of interest to the contributing research communities: (a) what can be done to deal with IPR and data protection issues? (b) what sampling techniques can we apply? (c) what quality issues should we be aware of? (d) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (e) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (f) what kinds of APIs or other means of access would make the corpus data as widely usable as possible without interfering with legal restrictions? (g) how to guarantee that corpus data remain available and usable in a sustainable way? This year’s event focused primarily on huge and complex datasets, across the entire spectrum of their life cycle: from the selection of data (including organizational and legal issues) and modelling of the eventual resources, through curation and all the way to analysis and visualisation. Attention was also paid to the ecosystem in which datasets thrive and interact – with interoperability being one of the meeting’s leitmotifs.

Abstract

Large textual datasets require careful design, collection, cleaning, encoding, annotation, storage, retrieval, and curation to be of use for a wide range of research questions and to users across a number of disciplines. A growing number of national and other very large corpora are being made available, many historical archives are being digitised, numerous publishing houses are opening their textual assets for text mining, and many billions of words can be quickly sourced from the web and online social media. A number of key themes and questions emerge that are of interest to the contributing research communities: (a) what can be done to deal with IPR and data protection issues? (b) what sampling techniques can we apply? (c) what quality issues should we be aware of? (d) what infrastructures and frameworks are being developed for the efficient storage, annotation, analysis and retrieval of large datasets? (e) what affordances do visualisation techniques offer for the exploratory analysis approaches of corpora? (f) what kinds of APIs or other means of access would make the corpus data as widely usable as possible without interfering with legal restrictions? (g) how to guarantee that corpus data remain available and usable in a sustainable way? This year’s event focused primarily on huge and complex datasets, across the entire spectrum of their life cycle: from the selection of data (including organizational and legal issues) and modelling of the eventual resources, through curation and all the way to analysis and visualisation. Attention was also paid to the ecosystem in which datasets thrive and interact – with interoperability being one of the meeting’s leitmotifs.

Statistics

Citations

Dimensions.ai Metrics

Altmetrics

Downloads

13 downloads since deposited on 17 Feb 2020
13 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Edited Scientific Work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
400 Language
Uncontrolled Keywords:comparable corpora; corpus infrastructures; corpus linguistics; corpus management; corpus processing; deduplication; parallel corpora; web corpora
Language:English
Date:22 July 2019
Deposited On:17 Feb 2020 09:42
Last Modified:17 Feb 2020 09:42
Publisher:Leibniz-Institut für Deutsche Sprache
Number of Pages:43
OA Status:Green
Free access at:Publisher DOI. An embargo period may apply.
Publisher DOI:https://doi.org/10.14618/ids-pub-8998
Official URL:http://corpora.ids-mannheim.de/cmlc-2019.html
Other Identification Number:urn:nbn:de:bsz:mh39-89986

Download

Green Open Access

Download PDF  'Challenges in the Management of Large Corpora (CMLC-7) 2019'.
Preview
Content: Published Version
Language: English
Filetype: PDF
Size: 2MB
View at publisher
Licence: Creative Commons: Attribution 4.0 International (CC BY 4.0)