Header

UZH-Logo

Maintenance Infos

Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures


Tang, Gongbo; Müller, Mathias; Rios, Annette; Sennrich, Rico (2018). Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures. In: Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, 2 November 2018 - 4 November 2018, ACL.

Abstract

Recently, non-recurrent architectures (convolutional, self-attentional) have outperformed RNNs in neural machine translation. CNNs and self-attentional networks can connect distant words via shorter network paths than RNNs, and it has been speculated that this improves their ability to model long-range dependencies. However, this theoretical argument has not been tested empirically, nor have alternative explanations for their strong performance been explored in-depth. We hypothesize that the strong performance of CNNs and self-attentional networks could also be due to their ability to extract semantic features from the source text, and we evaluate RNNs, CNNs and self-attention networks on two tasks: subject-verb agreement (where capturing long-range dependencies is required) and word sense disambiguation (where semantic feature extraction is required). Our experimental results show that: 1) self-attentional networks and CNNs do not outperform RNNs in modeling subject-verb agreement over long distances; 2) self-attentional networks perform distinctly better than RNNs and CNNs on word sense disambiguation.

Abstract

Recently, non-recurrent architectures (convolutional, self-attentional) have outperformed RNNs in neural machine translation. CNNs and self-attentional networks can connect distant words via shorter network paths than RNNs, and it has been speculated that this improves their ability to model long-range dependencies. However, this theoretical argument has not been tested empirically, nor have alternative explanations for their strong performance been explored in-depth. We hypothesize that the strong performance of CNNs and self-attentional networks could also be due to their ability to extract semantic features from the source text, and we evaluate RNNs, CNNs and self-attention networks on two tasks: subject-verb agreement (where capturing long-range dependencies is required) and word sense disambiguation (where semantic feature extraction is required). Our experimental results show that: 1) self-attentional networks and CNNs do not outperform RNNs in modeling subject-verb agreement over long distances; 2) self-attentional networks perform distinctly better than RNNs and CNNs on word sense disambiguation.

Statistics

Citations

97 citations in Web of Science®
119 citations in Scopus®
Google Scholar™

Downloads

97 downloads since deposited on 02 Nov 2018
7 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Speech), refereed, original work
Communities & Collections:06 Faculty of Arts > Institute of Computational Linguistics
Dewey Decimal Classification:000 Computer science, knowledge & systems
410 Linguistics
Language:English
Event End Date:4 November 2018
Deposited On:02 Nov 2018 14:13
Last Modified:13 Apr 2022 07:10
Publisher:ACL
OA Status:Green
Official URL:http://aclweb.org/anthology/D18-1458
Related URLs:https://arxiv.org/pdf/1808.08946.pdf
Project Information:
  • : FunderSNSF
  • : Grant ID105212_169888
  • : Project TitleRich Context in Neural Machine Translation
  • : FunderChinese Scholarship Council
  • : Grant ID201607110016
  • : Project Title
  • Content: Published Version