Navigation auf zora.uzh.ch

Search ZORA

ZORA (Zurich Open Repository and Archive)

A Corpus for Automatic Readability Assessment and Text Simplification of German

Battisti, Alessia; Pfütze, Dominik; Säuberli, Andreas; Kostrzewa, Marek; Ebling, Sarah (2020). A Corpus for Automatic Readability Assessment and Text Simplification of German. In: 12th Edition of its Language Resources and Evaluation Conference, Marseille, 11 May 2020 - 16 May 2020, European Language Resources Associatio.

Abstract

In this paper, we present a corpus for use in automatic readability assessment and automatic text simplification for German. The corpus is compiled from web sources and consists of parallel as well as monolingual-only (simplified German) data amounting to approximately 6,200 documents (nearly 211,000 sentences). As a unique feature, the corpus contains information on text structure (e.g., paragraphs, lines), typography (e.g., font type, font style), and images (content, position, and dimensions). While the importance of considering such information in machine learning tasks involving simplified language, such as readability assessment, has repeatedly been stressed in the literature, we provide empirical evidence for its benefit. We also demonstrate the added value of leveraging monolingual-only data for automatic text simplification via machine translation through applying back-translation, a data augmentation technique.

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Scopus Subject Areas:Social Sciences & Humanities > Language and Linguistics
Social Sciences & Humanities > Education
Social Sciences & Humanities > Library and Information Sciences
Social Sciences & Humanities > Linguistics and Language
Language:English
Event End Date:16 May 2020
Deposited On:01 Dec 2020 16:51
Last Modified:25 Oct 2022 09:37
Publisher:European Language Resources Associatio
OA Status:Green
Free access at:Official URL. An embargo period may apply.
Official URL:https://www.aclweb.org/anthology/2020.lrec-1.404.pdf
Download PDF  'A Corpus for Automatic Readability Assessment and Text Simplification of German'.
Preview
  • Content: Published Version

Metadata Export

Statistics

Citations

9 citations in Web of Science®
19 citations in Scopus®
Google Scholar™

Downloads

84 downloads since deposited on 01 Dec 2020
5 downloads since 12 months
Detailed statistics

Authors, Affiliations, Collaborations

Similar Publications