Header

UZH-Logo

Maintenance Infos

Corpora and Processing Tools for Non-Standard Contemporary and Diachronic Balkan Slavic


Vukovic, Teodora; Nora, Muheim; Winistörfer, Olivier-Andreas; Anastasia, Makarova; Ivan, Šimko; Sanja, Bradjan (2019). Corpora and Processing Tools for Non-Standard Contemporary and Diachronic Balkan Slavic. In: The 12th International Conference on Recent Advances in Natural Language Processing (RANLP 2019), Varna, Bulgaria, 2 September 2019 - 4 September 2019, 62-68.

Abstract

The paper describes three corpora of different varieties of BS that are currently being developed with the goal of providing data for the analysis of the diatopic and diachronic variation in non-standard Balkan Slavic. The corpora includes spoken materials from Torlak, Macedonian dialects, as well as the manuscripts of pre-standardized Bulgarian. Apart from the texts, tools for PoS annotation and lemmatization for all varieties are being created, as well as syntactic parsing for Torlak and Bulgarian varieties. The corpora are built using a unified methodology, relying on the pest practices and state-of-the-art methods from the field. The uniform methodology allows the contrastive analysis of the data from different varieties. The corpora under construction can be considered a crucial contribution to the linguistic research on the languages in the Balkans as they provide the lacking data needed for the studies of linguistic variation in the Balkan Slavic, and enable the comparison of the said varieties with other neighbouring languages.

Abstract

The paper describes three corpora of different varieties of BS that are currently being developed with the goal of providing data for the analysis of the diatopic and diachronic variation in non-standard Balkan Slavic. The corpora includes spoken materials from Torlak, Macedonian dialects, as well as the manuscripts of pre-standardized Bulgarian. Apart from the texts, tools for PoS annotation and lemmatization for all varieties are being created, as well as syntactic parsing for Torlak and Bulgarian varieties. The corpora are built using a unified methodology, relying on the pest practices and state-of-the-art methods from the field. The uniform methodology allows the contrastive analysis of the data from different varieties. The corpora under construction can be considered a crucial contribution to the linguistic research on the languages in the Balkans as they provide the lacking data needed for the studies of linguistic variation in the Balkan Slavic, and enable the comparison of the said varieties with other neighbouring languages.

Statistics

Downloads

24 downloads since deposited on 15 Oct 2019
24 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Slavonic Studies
Dewey Decimal Classification:490 Other languages
410 Linguistics
Language:English
Event End Date:4 September 2019
Deposited On:15 Oct 2019 15:29
Last Modified:15 Oct 2019 15:29
Publisher:INCOMA
OA Status:Green
Free access at:Official URL. An embargo period may apply.
Official URL:http://lml.bas.bg/ranlp2019/proceedings-RANLPStud-2019.pdf
Related URLs:https://www.slav.uzh.ch/de/institut/mitarbeitende/sprachwiss/tvukovic.html (Author)
https://uzh.academia.edu/TeodoraVuković (Author)

Download

Green Open Access

Download PDF  'Corpora and Processing Tools for Non-Standard Contemporary and Diachronic Balkan Slavic'.
Preview
Content: Published Version
Language: English
Filetype: PDF
Size: 309kB
Licence: Creative Commons: Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)