Publication:

Subword Evenness (SuE) as a Predictor of Cross-lingual Transfer to Low-resource Languages

Date

Date

Date
2022
Conference or Workshop Item
Published version
cris.lastimport.scopus2025-06-20T03:30:25Z
cris.virtual.orcid0000-0003-3124-190X
cris.virtualsource.orcid8992495d-fe06-4048-a30d-fa13159f83b8
dc.contributor.institutionUniversity of Zurich
dc.date.accessioned2023-02-22T11:06:05Z
dc.date.available2023-02-22T11:06:05Z
dc.date.issued2022-12-11
dc.description.abstract

Pre-trained multilingual models, such as mBERT, XLM-R and mT5, are used to improve the performance on various tasks in low-resource languages via cross-lingual transfer. In this framework, English is usually seen as the most natural choice for a transfer language (for fine-tuning or continued training of a multilingual pre-trained model), but it has been revealed recently that this is often not the best choice. The success of cross-lingual transfer seems to depend on some properties of languages, which are currently hard to explain. Successful transfer often happens between unrelated languages and it often cannot be explained by data-dependent factors.In this study, we show that languages written in non-Latin and non-alphabetic scripts (mostly Asian languages) are the best choices for improving performance on the task of Masked Language Modelling (MLM) in a diverse set of 30 low-resource languages and that the success of the transfer is well predicted by our novel measure of Subword Evenness (SuE). Transferring language models over the languages that score low on our measure results in the lowest average perplexity over target low-resource languages. Our correlation coefficients obtained with three different pre-trained multilingual models are consistently higher than all the other predictors, including text-based measures (type-token ratio, entropy) and linguistically motivated choice (genealogical and typological proximity).

dc.identifier.scopus2-s2.0-85149440220
dc.identifier.urihttps://www.zora.uzh.ch/handle/20.500.14742/205769
dc.language.isoeng
dc.subject.ddc000 Computer science, knowledge & systems
dc.subject.ddc410 Linguistics
dc.title

Subword Evenness (SuE) as a Predictor of Cross-lingual Transfer to Low-resource Languages

dc.typeconference_item
dcterms.accessRightsinfo:eu-repo/semantics/openAccess
dcterms.bibliographicCitation.urlhttps://aclanthology.org/2022.emnlp-main.503.pdf
dspace.entity.typePublicationen
oairecerif.event.countryUnited Arab Emirates
oairecerif.event.endDate2022-12-11
oairecerif.event.placeAbu Dhabi
oairecerif.event.startDate2022-12-07
uzh.contributor.authorPelloni, Olga
uzh.contributor.authorShaitarova, Anastassia
uzh.contributor.authorSamardžić, Tanja
uzh.contributor.correspondenceYes
uzh.contributor.correspondenceNo
uzh.contributor.correspondenceNo
uzh.document.availabilitypublished_version
uzh.eprint.datestamp2023-02-22 11:06:05
uzh.eprint.lastmod2023-12-29 08:28:58
uzh.eprint.statusChange2023-02-22 11:06:05
uzh.event.presentationTypepaper
uzh.event.title2022 Conference on Empirical Methods in Natural Language Processing
uzh.event.typeconference
uzh.harvester.ethYes
uzh.harvester.nbNo
uzh.identifier.doi10.5167/uzh-231103
uzh.oastatus.zoraGreen
uzh.publication.citationPelloni, Olga; Shaitarova, Anastassia; Samardžić, Tanja (2022). Subword Evenness (SuE) as a Predictor of Cross-lingual Transfer to Low-resource Languages. In: 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 7 December 2022 - 11 December 2022.
uzh.publication.freeAccessAtrelatedurl
uzh.publication.originalworkoriginal
uzh.publication.publishedStatusfinal
uzh.scopus.impact7
uzh.workflow.eprintid231103
uzh.workflow.fulltextStatuspublic
uzh.workflow.revisions17
uzh.workflow.rightsCheckoffen
uzh.workflow.statusarchive
Files

Original bundle

Name:
2022.emnlp_main.503.pdf
Size:
1.29 MB
Format:
Adobe Portable Document Format
Publication available in collections: