Header

UZH-Logo

Maintenance Infos

Enabling Energy-Efficient Inference for Self-Attention Mechanisms in Neural Networks


Chen, Qinyu; Sun, Congyi; Lu, Zhonghai; Gao, Chang (2022). Enabling Energy-Efficient Inference for Self-Attention Mechanisms in Neural Networks. In: 2022 IEEE 4th International Conference on Artificial Intelligence Circuits and Systems (AICAS), Incheon, Korea, Republic of, 13 June 2022 - 15 June 2022, IEEE.

Abstract

The study of specialized accelerators tailored for neural networks is becoming a promising topic in recent years. Such existing neural network accelerators are usually designed for convolutional neural networks (CNNs) or recurrent neural networks have been (RNNs), however, less attention has been paid to the attention mechanisms, which is an emerging neural network primitive with the ability to identify the relations within input entities. The self-attention-oriented models such as Transformer have achieved great performance on natural language processing, computer vision and machine translation. However, the self-attention mechanism has intrinsically expensive computational workloads, which increase quadratically with the number of input entities. Therefore, in this work, we propose an software-hardware co-design solution for energy-efficient self-attention inference. A prediction-based approximate self-attention mechanism is introduced to substantially reduce the runtime as well as power consumption, and then a specialized hardware architecture is designed to further increase the speedup. The design is implemented on a Xilinx XC7Z035 FPGA, and the results show that the energy efficiency is improved by 5.7x with less than 1% accuracy loss.

Abstract

The study of specialized accelerators tailored for neural networks is becoming a promising topic in recent years. Such existing neural network accelerators are usually designed for convolutional neural networks (CNNs) or recurrent neural networks have been (RNNs), however, less attention has been paid to the attention mechanisms, which is an emerging neural network primitive with the ability to identify the relations within input entities. The self-attention-oriented models such as Transformer have achieved great performance on natural language processing, computer vision and machine translation. However, the self-attention mechanism has intrinsically expensive computational workloads, which increase quadratically with the number of input entities. Therefore, in this work, we propose an software-hardware co-design solution for energy-efficient self-attention inference. A prediction-based approximate self-attention mechanism is introduced to substantially reduce the runtime as well as power consumption, and then a specialized hardware architecture is designed to further increase the speedup. The design is implemented on a Xilinx XC7Z035 FPGA, and the results show that the energy efficiency is improved by 5.7x with less than 1% accuracy loss.

Statistics

Citations

Dimensions.ai Metrics
2 citations in Web of Science®
1 citation in Scopus®
Google Scholar™

Altmetrics

Downloads

63 downloads since deposited on 16 Feb 2023
63 downloads since 12 months
Detailed statistics

Additional indexing

Item Type:Conference or Workshop Item (Paper), refereed, original work
Communities & Collections:07 Faculty of Science > Institute of Neuroinformatics
Dewey Decimal Classification:570 Life sciences; biology
Scopus Subject Areas:Physical Sciences > Artificial Intelligence
Physical Sciences > Computer Science Applications
Physical Sciences > Computer Vision and Pattern Recognition
Physical Sciences > Hardware and Architecture
Physical Sciences > Human-Computer Interaction
Physical Sciences > Electrical and Electronic Engineering
Language:English
Event End Date:15 June 2022
Deposited On:16 Feb 2023 16:24
Last Modified:17 Feb 2023 21:00
Publisher:IEEE
OA Status:Green
Publisher DOI:https://doi.org/10.1109/aicas54282.2022.9869924
  • Content: Accepted Version