Energy-Efficient Convolutional Neural Network Accelerators for Edge Intelligence
Aimar, Alessandro. Energy-Efficient Convolutional Neural Network Accelerators for Edge Intelligence. 2021, University of Zurich, Faculty of Science.
Abstract
Over the last ten years, the rise of deep learning has redefined the state-of-the-art in many computer vision and natural language processing tasks, with applications ranging from automated personal assistants and social network filtering to self-driving cars and drug development. The growth in popularity of these algorithms has its root in the exponential increase of computing power available for their training consequent to the diffusion of GPUs. The achieved increase in accuracy created the demand for faster, more power-efficient hardware suited for deployment on edge devices. In this thesis, we propose a set of innovations and technologies belonging to one of the many research lines sparkled by such demand, focusing on energy-efficient hardware for convolutional neural networks. We first study how a standard 28 nm CMOS process performs in the context of deep learning accelerators design, giving special consideration to the power and area of circuits based on standard cells when reduced precision arithmetic and short SRAM memory words are used. The outcome of this analysis indicates how the power-efficiency gain following the reduction of the bit precision is non-linear and how it saturates when using a precision of 16 bits. We propose Nullhop, an accelerator pioneering the use of feature map sparsity, typical of convolutional neural networks, and quantization to boost the hardware capabilities. Nullhop’s novelty is its ability to skip all multiplications including a zero-valued activation. It reaches a power efficiency of 3 TOP/s/W with a throughput of almost 0.5 TOP/s in 6.3 mm2 . We present a neural network quantization algorithm based on a hardware-software co-design approach. We demonstrate its capabilities training several networks on various tasks such as classification, object detection, segmentation, and image generation. The quantization scheme is implemented in Elements, a convolutional neural network accelerator architecture that supports variable weight bit precision as well as sparsity. We demonstrate Elements capabilities with multiple design parameterizations, suited for a wide range of applications. One of these parameterizations, called Deuterium, reaches an energy efficiency of over 4 TOP/s/W using only 3.3 mm2 . We further explore the concept of sparsity with a third convolutional neural network accelerator architecture called TwoNullhop, able to skip over zeros of both feature maps and kernels. We tested the TwoNullhop architecture with Carbon, an accelerator that, despite having only 128 multiply-accumulate units and running at a frequency of only 500 MHz, achieves more than 2.4 TOP/s with an energy efficiency of 10.2 TOP/s/W in only 4 mm2 . The thesis ends with an overview of the challenges and possibilities we foresee in the future of deep learning hardware development, trying to predict what themes are going to dominate the field in the years to come.
Abstract
Over the last ten years, the rise of deep learning has redefined the state-of-the-art in many computer vision and natural language processing tasks, with applications ranging from automated personal assistants and social network filtering to self-driving cars and drug development. The growth in popularity of these algorithms has its root in the exponential increase of computing power available for their training consequent to the diffusion of GPUs. The achieved increase in accuracy created the demand for faster, more power-efficient hardware suited for deployment on edge devices. In this thesis, we propose a set of innovations and technologies belonging to one of the many research lines sparkled by such demand, focusing on energy-efficient hardware for convolutional neural networks. We first study how a standard 28 nm CMOS process performs in the context of deep learning accelerators design, giving special consideration to the power and area of circuits based on standard cells when reduced precision arithmetic and short SRAM memory words are used. The outcome of this analysis indicates how the power-efficiency gain following the reduction of the bit precision is non-linear and how it saturates when using a precision of 16 bits. We propose Nullhop, an accelerator pioneering the use of feature map sparsity, typical of convolutional neural networks, and quantization to boost the hardware capabilities. Nullhop’s novelty is its ability to skip all multiplications including a zero-valued activation. It reaches a power efficiency of 3 TOP/s/W with a throughput of almost 0.5 TOP/s in 6.3 mm2 . We present a neural network quantization algorithm based on a hardware-software co-design approach. We demonstrate its capabilities training several networks on various tasks such as classification, object detection, segmentation, and image generation. The quantization scheme is implemented in Elements, a convolutional neural network accelerator architecture that supports variable weight bit precision as well as sparsity. We demonstrate Elements capabilities with multiple design parameterizations, suited for a wide range of applications. One of these parameterizations, called Deuterium, reaches an energy efficiency of over 4 TOP/s/W using only 3.3 mm2 . We further explore the concept of sparsity with a third convolutional neural network accelerator architecture called TwoNullhop, able to skip over zeros of both feature maps and kernels. We tested the TwoNullhop architecture with Carbon, an accelerator that, despite having only 128 multiply-accumulate units and running at a frequency of only 500 MHz, achieves more than 2.4 TOP/s with an energy efficiency of 10.2 TOP/s/W in only 4 mm2 . The thesis ends with an overview of the challenges and possibilities we foresee in the future of deep learning hardware development, trying to predict what themes are going to dominate the field in the years to come.
TrendTerms displays relevant terms of the abstract of this publication and related documents on a map. The terms and their relations were extracted from ZORA using word statistics. Their timelines are taken from ZORA as well. The bubble size of a term is proportional to the number of documents where the term occurs. Red, orange, yellow and green colors are used for terms that occur in the current document; red indicates high interlinkedness of a term with other terms, orange, yellow and green decreasing interlinkedness. Blue is used for terms that have a relation with the terms in this document, but occur in other documents.
You can navigate and zoom the map. Mouse-hovering a term displays its timeline, clicking it yields the associated documents.