Header

UZH-Logo

Maintenance Infos

Representing variation in a spoken corpus of an endangered dialect: the case of Torlak


Vuković, Teodora (2021). Representing variation in a spoken corpus of an endangered dialect: the case of Torlak. Language Resources and Evaluation:Epub ahead of print.

Abstract

The paper presents a spoken corpus of the endangered Torlak dialect from the Timok area of Southeast Serbia. This dialect expresses a great deal of variation in the use of non-standard features under the influence of standard Serbian (SSr). Accounting for this variation, a specific methodology has been selected for collection, sampling, transcription and annotation. Between 2015 and 2017, semi-structured interviews were conducted in the field eliciting spontaneous speech in the form of long narratives about traditional culture and history. The corpus comprises 500,697 tokens of semi-orthographic transcripts representing 80 h of recording from locations evenly distributed across the Timok area of the Torlak dialect zone, thus enabling a spatial contrastive analysis. The majority of speakers in the corpus are older people whose language represents the highly non-standard variety. In order to allow for analysis of language change under the influence of SSr, the corpus includes a number of younger people whose speech is closer to SSr. Tools for automatic PoS annotation and lemmatization that were lacking were developed based on the existing resources for SSr. For tagger training, a dialect sample of 27,000 manually verified tokens was merged with an existing training set for SSr.

Abstract

The paper presents a spoken corpus of the endangered Torlak dialect from the Timok area of Southeast Serbia. This dialect expresses a great deal of variation in the use of non-standard features under the influence of standard Serbian (SSr). Accounting for this variation, a specific methodology has been selected for collection, sampling, transcription and annotation. Between 2015 and 2017, semi-structured interviews were conducted in the field eliciting spontaneous speech in the form of long narratives about traditional culture and history. The corpus comprises 500,697 tokens of semi-orthographic transcripts representing 80 h of recording from locations evenly distributed across the Timok area of the Torlak dialect zone, thus enabling a spatial contrastive analysis. The majority of speakers in the corpus are older people whose language represents the highly non-standard variety. In order to allow for analysis of language change under the influence of SSr, the corpus includes a number of younger people whose speech is closer to SSr. Tools for automatic PoS annotation and lemmatization that were lacking were developed based on the existing resources for SSr. For tagger training, a dialect sample of 27,000 manually verified tokens was merged with an existing training set for SSr.

Statistics

Citations

Altmetrics

Downloads

4 downloads since deposited on 15 Jan 2021
4 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Journal Article, refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Slavonic Studies
Dewey Decimal Classification:490 Other languages
410 Linguistics
Uncontrolled Keywords:Linguistics and Language, Education, Library and Information Sciences, Language and Linguistics
Language:English
Date:9 January 2021
Deposited On:15 Jan 2021 10:49
Last Modified:20 Jan 2021 11:10
Publisher:Springer
ISSN:1574-020X
OA Status:Hybrid
Free access at:Publisher DOI. An embargo period may apply.
Publisher DOI:https://doi.org/10.1007/s10579-020-09522-4
Official URL:https://link.springer.com/article/10.1007/s10579-020-09522-4
Related URLs:https://www.clarin.si/repository/xmlui/handle/11356/1281 (Research Data)
Project Information:
  • : FunderSNSF
  • : Grant IDIZRPZ0_177557
  • : Project Title(Dis-)entangling traditions on the Central Balkans: Performance and perception (TraCeBa)
  • : FunderFP7
  • : Grant ID200307
  • : Project Title

Download

Hybrid Open Access

Download PDF  'Representing variation in a spoken corpus of an endangered dialect: the case of Torlak'.
Preview
Content: Published Version
Language: English
Filetype: PDF
Size: 1MB
View at publisher
Licence: Creative Commons: Attribution 4.0 International (CC BY 4.0)