Abstract
Using multilingual word embeddings for computing word alignments has been shown to be competetive with statistical word alignment methods. However, the languages on which the experiments were made on were all “seen” languages, i.e., they were part of the training data for the embeddings. In this thesis I show that multilingual word embeddings taken from mBERT can be used for computing word alignments for the “unseen” language Romansh, aligned against German. The performance is on par with a baseline statistical model (fast_align). I also describe the creation of a gold standard for evaluating the quality of word alignments for German–Romansh, as well as the process of data collection for compiling a trilingual corpus containing press releases in German, Italian and Romansh, published by the Swiss Canton of Grisons. From this corpus, I extracted around 80,000 unique sentence pairs for each language combination.