Abstract
The past decade has seen a resurgence of Deep Learning (DL) driven by the rapid advancement of computational power and the explosion of data. The massive parallel processing capacities of the Graphics Processing Units (GPU) and Application Specific Integrated Circuit (ASIC) clusters on the cloud have enabled training of large-scale Deep Neural Network (DNN) models, but they consume a considerable amount of power and risk leaking private data. Local learning on edge devices is becoming increasingly important in privacy-sensitive applications. However, the limited power budget and computational resources of edge devices pose challenges to edge training.
Efficient edge computing for DL requires optimized hardware architectures on customizable platforms such as Field Programmable Gate Array (FPGA) and ASIC. Previous work also investigated several optimization techniques to alleviate memory bottleneck and speedup training computations, such as quantizing data to lower bit precision or creating sparsity in network input and weights. While the acceleration of Convolutional Neural Networks (CNN) has been extensively studied, less work has focused on training of Recurrent Neural Networks (RNN), which are useful for applications that involve the processing of temporal sequences, such as edge audio voice wakeup, Keyword Spotting (KWS) and speech recognition. Due to the fully-connected neurons in RNNs, the computation consists mainly of Matrix-Vector Multiplication (MxV), which is a memory-bounded operation. The limited weight reuse available in RNNs also impedes parallelization. Moreover, the transposition of the weight matrix required in the backpropagation-based training process adds to the difficulty of maintaining consistent throughput and computing with sparse data during training.
In this thesis, we address these challenges using an algorithm-hardware co-design approach. We first designed a reconfigurable Processing Element (PE) architecture to overcome the transposed matrix problem. The associated accelerator called Efficient Incremental Learning on the Edge (EILE) showed nearly 100% PE utilization and preserved efficient burst mode Dynamic Random Access Memory (DRAM) access pattern across different stages in training. Next, we proposed a sparse RNN training algorithm that exhibits the bio-inspired temporal sparsity in both forward and backward propagation phases during training. Experimental results demonstrated that this algorithm can significantly reduce computational and memory costs without sacrificing accuracy. For example, training a 56k parameter Long Short-Term Memory (LSTM) RNN on the Google Speech Commands dataset with this algorithm showed a reduction of ∼80% in matrix operations with negligible accuracy loss, and a reduction of ∼ 80% in memory access for batch-1 training. Finally, we presented Recurrent Neural Network Training Accelerator with Temporal Sparsity (RENETA), the first RNN training accelerator that exploits temporal sparsity in the training process. By adopting the sparse RNN training algorithm, RENETA can efficiently exploit the temporal sparsity, and simulation results showed a speedup factor of 4.9X-9.6X for a sparsity ranging from 80%-90%. The pre-layout ASIC synthesis results of RENETA with 16 PEs operating at 200 MHz in a 65 nm tech node showed a theoretical throughput of 6.4 GOp/s without sparsity, a power consumption of 19.1 mW and an area of 0.28 mm2. Considering the above simulation results, RENETA had an effective throughput of 31.4 GOp/s to 61.5 GOp/s (80%-90% sparsity) and an effective core energy efficiency of 1.64 TOp/s/W to 3.22 TOp/s/W. A larger version of RENETA with 256 PEs showed a theoretical throughput of 102.4 GOp/s with a power consumption of 231.5 mW and an area of 3.32 mm2. With the simulated speedup factor of 4.9X-9.6X, the effective throughput is 0.50 TOp/s to 0.98 TOp/s and the effective core energy efficiency is 2.17 TOp/s/W to 4.25 TOp/s/W. RENETA can significantly reduce the DRAM access for batch-1 training and save substantial energy consumption, making it particularly useful for local learning on edge devices.