Header

UZH-Logo

Maintenance Infos

Does mBERT Understand Romansh? Evaluating Word Embeddings Using Word Alignment


Dolev, Eyal Liron (2023). Does mBERT Understand Romansh? Evaluating Word Embeddings Using Word Alignment. In: SwissText 2023, Neuchâtel, 12 June 2023 - 14 June 2023. Association for Computational Linguistics, 41-53.

Abstract

We test similarity-based word alignment models (SimAlign and awesome-align) in combination with word embeddings from mBERT and XLM-R on parallel sentences in German and Romansh. Since Romansh is an unseen language, we are dealing with a zero-shot setting. Using embeddings from mBERT, both models reach an alignment error rate of 0.22, which outperforms fast_align, a statistical model, and is on par with similarity-based word alignment for seen languages. We interpret these results as evidence that mBERT contains information that can be meaningful and applicable to Romansh.
To evaluate performance, we also present a new trilingual corpus, which we call the DERMIT (DE-RM-IT) corpus, containing press releases made by the Canton of Grisons in German, Romansh and Italian in the past 25 years. The corpus contains 4 547 parallel documents and approximately 100 000 sentence pairs in each language combination. We additionally present a gold standard for German-Romansh word alignment. The data is available at https://github.com/eyldlv/DERMIT-Corpus.

Abstract

We test similarity-based word alignment models (SimAlign and awesome-align) in combination with word embeddings from mBERT and XLM-R on parallel sentences in German and Romansh. Since Romansh is an unseen language, we are dealing with a zero-shot setting. Using embeddings from mBERT, both models reach an alignment error rate of 0.22, which outperforms fast_align, a statistical model, and is on par with similarity-based word alignment for seen languages. We interpret these results as evidence that mBERT contains information that can be meaningful and applicable to Romansh.
To evaluate performance, we also present a new trilingual corpus, which we call the DERMIT (DE-RM-IT) corpus, containing press releases made by the Canton of Grisons in German, Romansh and Italian in the past 25 years. The corpus contains 4 547 parallel documents and approximately 100 000 sentence pairs in each language combination. We additionally present a gold standard for German-Romansh word alignment. The data is available at https://github.com/eyldlv/DERMIT-Corpus.

Statistics

Downloads

4 downloads since deposited on 04 Jan 2024
4 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Speech), not_refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
06 Faculty of Arts > Zurich Center for Linguistics
Dewey Decimal Classification:410 Linguistics
Language:English
Event End Date:14 June 2023
Deposited On:04 Jan 2024 13:02
Last Modified:20 Mar 2024 13:08
Publisher:Association for Computational Linguistics
Series Name:Proceedings of the Swiss Text Analytics Conference
Additional Information:8th edition
OA Status:Green
Official URL:https://aclanthology.org/2023.swisstext-1.5
  • Content: Accepted Version
  • Language: English
  • Licence: Creative Commons: Attribution 4.0 International (CC BY 4.0)