国家自然科学基金(61832018)
国家自然科学基金(61572025)
[1] Hinton G E, Osindero S, Teh Y W. A fast learning algorithm for deep belief nets.. Neural Computation, 2006, 18: 1527-1554 CrossRef PubMed Google Scholar
[2] Rumelhart D E. Learning internal representations by error propagation, parallel distributed processing. In: Explorations in the Microstructure of Cognition. Cambridge: MIT Press, 1986. Google Scholar
[3] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks. In: Proceedings of International Conference on Neural Information Processing Systems, 2012. 1097--1105. Google Scholar
[4] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014,. arXiv Google Scholar
[5] Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. Google Scholar
[6] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. Google Scholar
[7] Park E, Kim D, Yoo S. Energy-efficient neural network accelerator based on outlier-aware low-precision computation. In: Proceedings of International Symposium on Computer Architecture, 2018. 688--698. Google Scholar
[8] Parashar A, Rhu M, Mukkara A, et al. SCNN: an accelerator for compressed-sparse convolutional neural networks. In: Proceedings of International Symposium on Computer Architecture, 2017. 27--40. Google Scholar
[9] Yu J, Lukefahr A, Palframan D. Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism. SIGARCH Comput Archit News, 2017, 45: 548-560 CrossRef Google Scholar
[10] Chen Y H, Emer J, Sze V. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks. In: Proceedings of International Symposium on Computer Architecture, 2016. 367--379. Google Scholar
[11] Akhlaghi V, Yazdanbakhsh A, Samadi K, et al. SnaPEA: predictive early activation for reducing computation in deep convolutional neural networks. In: Proceedings of International Symposium on Computer Architecture, 2018. 662--673. Google Scholar
[12] Hegde K, Yu J, Agrawal R, et al. UCNN: exploiting computational reuse in deep neural networks via weight repetition. In: Proceedings of International Symposium on Computer Architecture, 2018. 674--687. Google Scholar
[13] Zhang S J, Du Z D, Zhang L, et al. Cambricon-X: an accelerator for sparse neural networks. In: Proceedings of International Symposium on Microarchitecture, 2016. Google Scholar
[14] Peemen M, Setio A A A, Mesman B, et al. Memory-centric accelerator design for convolutional neural networks. In: Proceedings of International Conference on Computer Design, 2013. Google Scholar
[15] Yazdani R, Riera M, Arnau J M, et al. The dark side of DNN pruning. In: Proceedings of International Symposium on Computer Architecture, 2018. 790--801. Google Scholar
[16] Yu J, Lukefahr A, Palframan D, et al. Scalpel: customizing DNN pruning to the underlying hardware parallelism. In: Proceeding of the 44th Annual International Symposium on Computer Architecture, 2017. 548--560. Google Scholar
[17] Sze V, Chen Y H, Yang T J. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc IEEE, 2017, 105: 2295-2329 CrossRef Google Scholar
[18] Lavin A, Gray S. Fast algorithms for convolutional neural networks. In: Proceedings of Computer Vision and Pattern Recognition, 2016. 4013--4021. Google Scholar
[19] Cong J S, Xiao B J. Minimizing computation in convolutional neural networks. In: Proceedings of International Conference on Artificial Neural Networks, 2014. 281--290. Google Scholar
[20] Zhang J Y, Guo Y, Hu X. Design and implementation of deep neural network for edge computing. IEICE Trans Inf Syst, 2018, 101: 1982--1996. Google Scholar
[21] Lu L, Liang Y, Xiao Q, et al. Evaluating fast algorithms for convolutional neural networks on FPGAs. In: Proceedings of the 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2017. 101--108. Google Scholar
[22] Kung H T. Why systolic architectures?. Computer, 1982, 15: 37-46 CrossRef Google Scholar
[23] Jouppi N P, Young C, Patil N, et al. In-Datacenter Performance Analysis of a Tensor Processing Unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, 2017. Google Scholar
[24] Du Z D, Fasthuber R, Chen T S, et al. Shidiannao: shifting vision processing closer to the sensor. In: Proceedings of the 42nd ACM/IEEE International Symposium on Computer Architecture, 2015. Google Scholar
[25] Desoli G, Chawla N, Boesch T, et al. 14.1 A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28 nm for intelligent embedded systems. In: Proceedings of Solid-State Circuits Conference, 2017. 238--239. Google Scholar
[26] Liu S, Du Z, Tao J. Cambricon: an instruction set architecture for neural networks. SIGARCH Comput Archit News, 2016, 44: 393-405 CrossRef Google Scholar
[27] Li M, Huang R. Device and integration technologies for VLSI in post-Moore era (in Chinese). Sci Sin Inform, 2018, 48: 963-977 CrossRef Google Scholar
[28] Intel Xeon Phi Knights Mill for Machine Learning. 2017-10-18. https://www.servethehome.com/intel-knights-mill-for-machine-learning/. Google Scholar
[29] Oh K S, Jung K. GPU implementation of neural networks. Pattern Recogn, 2004, 37: 1311--1314. Google Scholar
[30] Coates A, Baumstarck P, Le Q, et al. Scalable learning for object detection with GPU hardware. In: Proceedings of International Conference on Intelligent Robots and Systems, 2009. 4287--4293. Google Scholar
[31] Yun S B, Kim Y J, Dong S S, et al. Hardware implementation of neural network with expansible and reconfigurable architecture. In: Proceedings of International Conference on Neural Information Processing, 2002. 970--975. Google Scholar
[32] Farabet C, Martini B, Corda B, et al. NeuFlow: a runtime reconfigurable dataflow processor for vision. In: Proceedings of Computer Vision and Pattern Recognition Workshops, 2011. 109--116. Google Scholar
[33] Zhang C, Prasanna V. Frequency domain acceleration of convolutional neural networks on CPU-FPGA shared memory system. In: Proceedings of ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017. 35--44. Google Scholar
[34] Pham P H, Jelaca D, Farabet C, et al. NeuFlow: dataflow vision processing system-on-a-chip. In: Proceedings of the 55th International Midwest Symposium on Circuits and Systems, 2012. 1044--1047. Google Scholar
[35] Chen T, Du Z, Sun N, et al. DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM Sigplan Notices, 2014, 49: 269-284. Google Scholar
[36] Chen Y J, Luo T, Liu S L, et al. DaDianNao: a machine-learning supercomputer. In: Proceedings of International Symposium on Microarchitecture, 2014. 609--622. Google Scholar
[37] Liu D, Chen T, Liu S. PuDianNao: A Polyvalent Machine Learning Accelerator. SIGARCH Comput Archit News, 2015, 43: 369-381 CrossRef Google Scholar
[38] Du Z D, Fasthuber R, Chen T S, et al. Shidiannao: shifting vision processing closer to the sensor. In: Proceedings of the 42nd International Symposium on Computer Architecture, 2015. Google Scholar
[39] Han S, Liu X, Mao H. EIE: efficient inference engine on compressed deep neural network. SIGARCH Comput Archit News, 2016, 44: 243-254 CrossRef Google Scholar
[40] Merolla P A, Arthur J V, Alvarez-Icaza R. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 2014, 345: 668-673 CrossRef PubMed ADS Google Scholar
[41] Kumar S. Introducing qualcomm zeroth processors: brain-inspired computing, 2013. https://www.qualcomm.com/news/onq/2013/10/10/introducing-qualcomm-zeroth-processors-brain-inspired-computing. Google Scholar
[42] Demler M. CEVA XM6 Accelerates Neural Nets. 2016. https://www.ceva-dsp.com/wp-content/uploads/2017/02/MPR-CEVA-XM6-Accelerates-Neural-Nets.pdf. Google Scholar
[43] Demler M. CEVA NeuPro Accelerates Neural Nets. 2018. https://www.ceva-dsp.com/wp-content/uploads/2018/02/Ceva-NeuPro-Accelerates-Neural-Nets.pdf. Google Scholar
[44] Chen S, Wang Y, Liu S. FT-Matrix: A Coordination-Aware Architecture for Signal Processing. IEEE Micro, 2014, 34: 64-73 CrossRef Google Scholar
[45] Tan H B, Chen H Y, Liu S, et al. Modeling and evaluation for gather/scatter operations in vector-SIMD architectures. In: Proceedings of the 28th International Conference on Application-specific Systems, Architectures and Processors, 2017. Google Scholar
Figure 1
(Color online) Evolution of CNN algorithms
Figure 2
(Color online) Algorithm model of LeNet5
Figure 3
(Color online) Multidimensional convolution algorithm
Figure 4
(Color online) Data repetition in convolution
Figure 5
(Color online) Relationship among precision, bitwidth, and accuracy in LeNet algorithm
Figure 6
(Color online) Acceleration of convolution with Toeplitz matrix
Figure 7
(Color online) Acceleration of convolution with FFT
Figure 8
(Color online) Systolic array in TPU
Figure 9
(Color online) Requirements of deep learning hardware acceleration
Figure 10
(Color online) Architecture of CEVA Neu-protect łinebreakPro
Figure 11
(Color online) Architecture of Synopsys EV6x
Figure 12
(Color online) Cadence Tensilica Vision C5
Figure 13
(Color online) Architecture of STM DSP
Figure 14
(Color online) Multi-core architecture of FT-M7002 DSP
Figure 15
(Color online) Software support for FT-M7002 DSP
Figure 16
(Color online) Architecture of single core of FT-Matrix
Figure 17
(Color online) Matrix multiplication on FT-Matrix DSP
Figure 18
(Color online) Matrix transposed transmission of DMA in FT-Matrix
Figure 19
(Color online) Problem of sparse matrix in deep learning hardware
Figure 20
(Color online) Configurable deep learning acceleration based on FT-Matrix
Figure 21
(Color online) Configurable computing array based on FT-Matrix VPU
Figure 22
(Color online) Comparison of MAC structures before and after optimization
Figure 23
(Color online) VGather/VScatter instruction of FT-Matrix
Hardware platform | Advantages | Disadvantages | Technical solutions |
CPU/GPU | Flexible programming, | For general-purpose computing, | CPU/GPU + function |
supporting for multiple algorithms, | high energy consumption for | expansion/specialized | |
and have lots of technical accumulation. | deep learning acceleration. | processing module. | |
FPGA | Configurable, short design cycle. | Relatively high unit energy | Customized for algorithm. |
consumption and delay. | |||
ASIC | Customized, performance, | Long development cycle, | Customized for algorithm. |
power consumption, | required manpower and material | ||
and latency all have advantages. | resources; not flexible enough. | ||
Imitation | Low power consumption, | Low precision, | Imitate biological neural |
biological | consistent with neural | limited by current technology. | networks, using new |
chips | network prototype. | technologies and materials. |
Chips | Convolution: 144$\times$5 | Convolution: 16$\times$5 | Matrix$\times$matrix: 144$\times$144 | Matrix$\times$vector: 144 |
CPU | 0.0013558 | 0.0001026 | 0.000802 | 0.0004747 |
GPU | 0.0002323 | 0.0001902 | 0.0002809 | 0.0002488 |
Matrix | 0.000218406 | 0.000005787 | 0.000026067 | 0.000000541 |
Matrix/CPU | 6.207704917 | 17.72939347 | 30.76686999 | 877.4491682 |
Matrix/GPU | 1.063615468 | 32.86677035 | 10.77607703 | 459.8890943 |
Instruction type | Main function |
1. Flow control | Scalar branch, vector branch, wait, nop, etc. |
2. Scalar load/stroe | Scalar load/stroe of half/one/double/quad word with linear or circular addressing. |
3. Scalar MAC1 | Basic operation (+/$-$, $\times$, /), FMA, dot, complex multiplication, square root, elementary, |
4. Scalar MAC2 | functions (sine/cosine/exp/log), format conversion, floating-point logic ops etc. |
5. Scalar BP | Fixed-point +/$-$, shift, test (=, !=, $>$ , $<$ , etc.), logical ops, bit ops, broadcast ops, etc. |
6. Vector load/stroe 1 | (16x) vector load/store of half/one/double/quad word with linear or circular addressing. |
7. Vector load/stroe 2 | |
8. Vector MAC1 | (16x) vector basic operation (+/$-$, $\times$), FMA, dot, complex multiplication, |
9. Vector MAC2 | data format conversion, floating-point logic ops etc. |
10. Vector MAC3 | |
11. Vector BP | (16x) vector fixed-point +/$-$, shift, test, logical ops, bit ops, shuffle, reduction, etc. |
Banks | MCPC $\leq~$ 2 | MCPC $\leq~$ 3 | MCPC $\leq~$ 4 | MCPC $\leq~$ 5 | MCPC expect |
4 | 0.7969 | 0.9844 | 0.9999 | 0.9999 | 2.1249 |
8 | 0.5008 | 0.9101 | 0.9902 | 0.9993 | 2.5934 |
16 | 0.1948 | 0.7688 | 0.9629 | 0.9956 | 3.0515 |
32 | 0.0293 | 0.5456 | 0.9073 | 0.9866 | 3.4508 |
64 | 0.0000 | 0.2740 | 0.8041 | 0.9682 | 3.7608 |