Navigation auf zora.uzh.ch

Search ZORA

ZORA (Zurich Open Repository and Archive)

Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets

Abstract

With the success of large-scale pre-training and multilingual modeling in Natural Language Processing (NLP), recent years have seen a proliferation of large, Web-mined text datasets covering hundreds of languages. We manually audit the quality of 205 language-specific corpora released with five major public datasets (CCAligned, ParaCrawl, WikiMatrix, OSCAR, mC4). Lower-resource corpora have systematic issues: At least 15 corpora have no usable text, and a significant fraction contains less than 50% sentences of acceptable quality. In addition, many are mislabeled or use nonstandard/ambiguous language codes. We demonstrate that these issues are easy to detect even for non-proficient speakers, and supplement the human audit with automatic analyses. Finally, we recommend techniques to evaluate and improve multilingual corpora and discuss potential risks that come with low-quality data releases.

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:410 Linguistics
000 Computer science, knowledge & systems
Scopus Subject Areas:Social Sciences & Humanities > Communication
Physical Sciences > Human-Computer Interaction
Social Sciences & Humanities > Linguistics and Language
Physical Sciences > Computer Science Applications
Physical Sciences > Artificial Intelligence
Language:English
Date:31 January 2022
Deposited On:04 Dec 2024 14:15
Last Modified:05 Dec 2024 21:00
Publisher:Massachusetts Institute of Technology Press
ISSN:2307-387X
OA Status:Gold
Publisher DOI:https://doi.org/10.1162/tacl_a_00447
Download PDF  'Quality at a Glance: An Audit of Web-Crawled Multilingual Datasets'.
Preview
  • Content: Published Version
  • Language: English
  • Licence: Creative Commons: Attribution 4.0 International (CC BY 4.0)

Metadata Export

Statistics

Citations

Dimensions.ai Metrics
55 citations in Web of Science®
145 citations in Scopus®
Google Scholar™

Altmetrics

Downloads

0 downloads since deposited on 04 Dec 2024
0 downloads since 12 months

Authors, Affiliations, Collaborations

Similar Publications