Header

UZH-Logo

Maintenance Infos

A Corpus for Automatic Readability Assessment and Text Simplification of German


Battisti, Alessia; Pfütze, Dominik; Säuberli, Andreas; Kostrzewa, Marek; Ebling, Sarah (2020). A Corpus for Automatic Readability Assessment and Text Simplification of German. In: 12th Edition of its Language Resources and Evaluation Conference, Marseille, 11 May 2020 - 16 May 2020.

Abstract

In this paper, we present a corpus for use in automatic readability assessment and automatic text simplification for German. The corpus is compiled from web sources and consists of parallel as well as monolingual-only (simplified German) data amounting to approximately 6,200 documents (nearly 211,000 sentences). As a unique feature, the corpus contains information on text structure (e.g., paragraphs, lines), typography (e.g., font type, font style), and images (content, position, and dimensions). While the importance of considering such information in machine learning tasks involving simplified language, such as readability assessment, has repeatedly been stressed in the literature, we provide empirical evidence for its benefit. We also demonstrate the added value of leveraging monolingual-only data for automatic text simplification via machine translation through applying back-translation, a data augmentation technique.

Abstract

In this paper, we present a corpus for use in automatic readability assessment and automatic text simplification for German. The corpus is compiled from web sources and consists of parallel as well as monolingual-only (simplified German) data amounting to approximately 6,200 documents (nearly 211,000 sentences). As a unique feature, the corpus contains information on text structure (e.g., paragraphs, lines), typography (e.g., font type, font style), and images (content, position, and dimensions). While the importance of considering such information in machine learning tasks involving simplified language, such as readability assessment, has repeatedly been stressed in the literature, we provide empirical evidence for its benefit. We also demonstrate the added value of leveraging monolingual-only data for automatic text simplification via machine translation through applying back-translation, a data augmentation technique.

Statistics

Downloads

3 downloads since deposited on 01 Dec 2020
3 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Scopus Subject Areas:Social Sciences & Humanities > Language and Linguistics
Social Sciences & Humanities > Education
Social Sciences & Humanities > Library and Information Sciences
Social Sciences & Humanities > Linguistics and Language
Language:English
Event End Date:16 May 2020
Deposited On:01 Dec 2020 16:51
Last Modified:08 Feb 2021 10:37
Publisher:European Language Resources Associatio
OA Status:Green
Free access at:Official URL. An embargo period may apply.
Official URL:https://www.aclweb.org/anthology/2020.lrec-1.404.pdf

Download

Green Open Access

Download PDF  'A Corpus for Automatic Readability Assessment and Text Simplification of German'.
Preview
Content: Published Version
Filetype: PDF
Size: 219kB