Header

UZH-Logo

Maintenance Infos

Automatic Cluster Analysis of Texts in Simplified German


Battisti, Alessia. Automatic Cluster Analysis of Texts in Simplified German. 2019, University of Zurich, Faculty of Arts.

Abstract

Text simplification is the process of reducing lexical and syntactic complexity of a text, while preserving most of the original information content [Saggion, 2017, 1]. This process aims at making texts accessible for everyone, including persons with low literacy skills, cognitive or learning disabilities, aphasia or dementia, among others. Because of the heterogeneity of the target users, simplified German as an instance of simplified language has been
conceptualised at multiple complexity levels [Bredel and Maaß, 2016; Bock, 2014; Kellermann, 2014]. However, to date neither guidelines nor evidence support this claim. In this master thesis, I present an approach to automatically analyse existing texts in simplified German, with the goal of investigating evidence of multiple complexity levels. This approach was tested with two different corpora in simplified German. The first task in my analysis is to address a key question in text simplification research, namely the identification of complexity structures of given texts. This includes the creation of a feature framework reflecting the linguistic and structural characteristics of texts in simplified German. The second task is to cluster documents by exploring various unsupervised algorithms and combinations of the previously extracted features. In the third task, the output of the cluster analysis is validated to calculate its robustness; finally, the clustering results are linguistically interpreted to identify feature behaviours. The results show that clustering techniques are able to discriminate among texts in simplified German, suggesting that some groups of texts share a high degree of linguistic similarity. This thesis emphasises the necessity of exploring not only linguistic features but also structural and layout characteristics of simplified language in order to meet the requirements of the various target users.

Abstract

Text simplification is the process of reducing lexical and syntactic complexity of a text, while preserving most of the original information content [Saggion, 2017, 1]. This process aims at making texts accessible for everyone, including persons with low literacy skills, cognitive or learning disabilities, aphasia or dementia, among others. Because of the heterogeneity of the target users, simplified German as an instance of simplified language has been
conceptualised at multiple complexity levels [Bredel and Maaß, 2016; Bock, 2014; Kellermann, 2014]. However, to date neither guidelines nor evidence support this claim. In this master thesis, I present an approach to automatically analyse existing texts in simplified German, with the goal of investigating evidence of multiple complexity levels. This approach was tested with two different corpora in simplified German. The first task in my analysis is to address a key question in text simplification research, namely the identification of complexity structures of given texts. This includes the creation of a feature framework reflecting the linguistic and structural characteristics of texts in simplified German. The second task is to cluster documents by exploring various unsupervised algorithms and combinations of the previously extracted features. In the third task, the output of the cluster analysis is validated to calculate its robustness; finally, the clustering results are linguistically interpreted to identify feature behaviours. The results show that clustering techniques are able to discriminate among texts in simplified German, suggesting that some groups of texts share a high degree of linguistic similarity. This thesis emphasises the necessity of exploring not only linguistic features but also structural and layout characteristics of simplified language in order to meet the requirements of the various target users.

Statistics

Downloads

31 downloads since deposited on 05 Feb 2021
31 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Master's Thesis
Referees:Ebling Sarah, Volk Martin
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Uncontrolled Keywords:Simplified German, automatic readability assessment, automatic text simplification, Simplified
Language:English
Date:2019
Deposited On:05 Feb 2021 16:35
Last Modified:08 Feb 2021 14:23
OA Status:Green

Download

Green Open Access

Download PDF  'Automatic Cluster Analysis of Texts in Simplified German'.
Preview
Content: Published Version
Language: English
Filetype: PDF
Size: 4MB