Header

UZH-Logo

Maintenance Infos

Automatic authorship attribution based on character n-grams in Swiss German


Oppliger, Rahel (2016). Automatic authorship attribution based on character n-grams in Swiss German. In: KONVENS 2016, Bochum, 19 September 2016 - 21 September 2016, 177-185.

Abstract

Automatic authorship attribution aims to train computers to identify the author of a disputed text based on idiolectal language features. When confronted with nonstandard data – in the present study Swiss German instant messages – languagespecific NLP toolkits are often unavailable, limiting the availability of features to classify texts. Thus, the approach I propose for Swiss German is based on character ngrams, which not only avoids the problem of a lack of available NLP tools, but – in addition to being a proven successful feature for authorship attribution – allows the capturing of orthographical idiosyncrasies. It thus allows the exploitation of Swiss German’s lack of standardised spelling rules, turning the challenge that Swiss German presents as non-standard data into an advantage. Different lengths of n-grams as features of a Na¨ıve Bayes classifier combined with varying sizes of training and test corpora were tested, and 6- and 7-grams were found to faultlessly identify authors for all combinations considered. The number of distinctive n-grams in an author’s data set was found to be a determining factor for the classifier’s success, highlighting the benefits of exploiting Swiss German’s non-standard nature for authorship identification.

Abstract

Automatic authorship attribution aims to train computers to identify the author of a disputed text based on idiolectal language features. When confronted with nonstandard data – in the present study Swiss German instant messages – languagespecific NLP toolkits are often unavailable, limiting the availability of features to classify texts. Thus, the approach I propose for Swiss German is based on character ngrams, which not only avoids the problem of a lack of available NLP tools, but – in addition to being a proven successful feature for authorship attribution – allows the capturing of orthographical idiosyncrasies. It thus allows the exploitation of Swiss German’s lack of standardised spelling rules, turning the challenge that Swiss German presents as non-standard data into an advantage. Different lengths of n-grams as features of a Na¨ıve Bayes classifier combined with varying sizes of training and test corpora were tested, and 6- and 7-grams were found to faultlessly identify authors for all combinations considered. The number of distinctive n-grams in an author’s data set was found to be a determining factor for the classifier’s success, highlighting the benefits of exploiting Swiss German’s non-standard nature for authorship identification.

Statistics

Downloads

4 downloads since deposited on 19 Mar 2019
4 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > English Department
Dewey Decimal Classification:820 English & Old English literatures
Language:English
Event End Date:21 September 2016
Deposited On:19 Mar 2019 16:16
Last Modified:19 Mar 2019 20:30
Publisher:Sprachwissenschaftliches Institut, Ruhr-Universität Bochum
Series Name:Bochumer Linguistische Arbeitsberichte
Number:16
ISSN:2190-0949
Additional Information:Titel der Publikation: Proceedings of the 13th Conference on Natural Language Processing (KONVENS) Bochum, Germany, September 19–21, 2016
OA Status:Green
Official URL:https://www.linguistics.rub.de/konvens16/pub/22_konvensproc.pdf
Related URLs:https://www.linguistics.rub.de/konvens16/index.html (Organisation)
https://www.linguistics.rub.de/forschung/arbeitsberichte/ (Organisation)

Download

Download PDF  'Automatic authorship attribution based on character n-grams in Swiss German'.
Preview
Content: Published Version
Language: English
Filetype: PDF
Size: 281kB