This work was supported by National Key Research and Development Project of China (Grant No. 2016YFB- 0200500).
[1] Moore G E. Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 ff.. IEEE Solid-State Circuits Soc Newsl, 2006, 11: 33-35 CrossRef Google Scholar
[2] Dennard R H, Gaensslen F H, Yu H N. Design of ion-implanted MOSFET's with very small physical dimensions. IEEE J Solid-State Circuits, 1974, 9: 256-268 CrossRef ADS Google Scholar
[3] Agerwala T. Challenges on the road to exascale computing. In: Proceedings of the 22nd Annual International Conference on Supercomputing, 2008. 2. Google Scholar
[4] Alvin K, Barrett B, Brightwell R. On the Path to Exascale. Int J Distributed Syst Technologies, 2010, 1: 1-22 CrossRef Google Scholar
[5] Beckman P. Looking toward exascale computing. In: Proceedings of the 9th International Conference on Parallel and Distributed Computing, Applications and Technologies, 2008. 3. Google Scholar
[6] Balaprakash P, Buntinas D, Chan A, et al. Exascale workload characterization and architecture implications. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2013. 120--121. Google Scholar
[7] Dally B. Power, programmability, and granularity: the challenges of exascale computing. In: Proceedings of IEEE International Test Conference, 2011. 12. Google Scholar
[8] Hluchy L, Bobák M, Müller H, et al. Heterogeneous exascale Computing. In: Recent Advances in Intelligent Engineering. Cham: Springer, 2020. 81--110. Google Scholar
[9] Kogge P, Shalf J. Exascale Computing Trends: Adjusting to the “New Normal”' for Computer Architecture. Comput Sci Eng, 2013, 15: 16-26 CrossRef ADS Google Scholar
[10] Lu Y. Paving the way for China exascale computing. CCF Trans HPC, 2019, 1: 63-72 CrossRef Google Scholar
[11] Shalf J, Dosanjh S S, Morrison J P. Exascale computing technology challenges. In: Proceedings of the 9th International Conference on High Performance Computing for Computational Science, 2010. 1--25. Google Scholar
[12] Vijayaraghavany T, Eckert Y, Loh G H, et al. Design and analysis of an APU for exascale computing. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017. 85--96. Google Scholar
[13] Feng J Q, Gu W D, Pan J S. Parallel Implementation of BP Neural Network for Traffic Prediction on Sunway Blue Light Supercomputer. AMM, 2014, 614: 521-525 CrossRef Google Scholar
[14] Tian M, Gu W, Pan J, et al. Performance Analysis and Optimization of PalaBos on Petascale Sunway BlueLight MPP Supercomputer. In: Proceedings of International Conference on Parallel Computing in Fluid Dynamics, 2013. 311--320. Google Scholar
[15] Chen Y, Li K, Yang W. Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer. IEEE Trans Parallel Distrib Syst, 2019, 30: 923-938 CrossRef Google Scholar
[16] Fang J, Fu H, Zhao W, et al. swDNN: a library for accelerating deep learning applications on Sunway TaihuLight. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2017. 615--624. Google Scholar
[17] Fu H, Liao J, Yang J. The Sunway TaihuLight supercomputer: system and applications. Sci China Inf Sci, 2016, 59: 072001 CrossRef Google Scholar
[18] Zhang J, Zhou C, Wang Y, et al. Extreme-scale phase field simulations of coarsening dynamics on the sunway taihulight supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2016. 4. Google Scholar
[19] Xu Y, Xie X H, Li H L. 一种面向高性能计算的自主众核处理器结构. Sci Sin-Inf, 2015, 45: 523-534 CrossRef Google Scholar
[20] Lin H, Zhu X, Yu B, et al. ShenTu: processing multi-trillion edge graphs on millions of cores in seconds. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, 2018. 56. Google Scholar
[21] Meng D-L, Wen M-H, Wei J-W, et al. Porting and optimizing OpenFOAM on Sunway TaihuLight System. Comput Sci, 2017, 44: 64-70. doi: 10.11896/j.issn.1002-137X.2017.10.012. Google Scholar
[22] Fu H, Liu W, Wang L, et al. Redesigning CAM-SE for peta-scale climate modeling performance and ultra-high resolution on Sunway TaihuLight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017. 1. Google Scholar
[23] Fu H, Yin W, Yang G, et al. 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017. 2. Google Scholar
[24] Williams S W, Patterson D A, Oliker L, et al. The Roofline Model: A pedagogical tool for auto-tuning kernels on multicore architectures. 2008. Google Scholar
[25] Oral S, Vazhkudai S S, Wang F, et al. End-to-end I/O portfolio for the summit supercomputing ecosystem. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, 2019. 1--14. Google Scholar
[26] Shi X, Li M, Liu W, et al. SSDUP: a traffic-aware ssd burst buffer for HPC systems. In: Proceedings of the International Conference on Supercomputing, 2017. 1--10. Google Scholar
[27] Shi X, Liu W, He L, et al. Optimizing the SSD Burst Buffer by Traffic Detection. ACM Transactions on Architecture and Code Optimization (TACO), 2020, 17: 1-26. Google Scholar
[28] He WQ, L Y, Fang YF, Wei D, Qi FB. Design and implementation of Parallel C programming language for domestic heterogeneous many-core systems. Ruan Jian Xue Bao/Journal of Software, 2017, 28(4): 764--785 DOI: 10.13328/j.cnki.jos.005197. Google Scholar
[29] Schroeder B, Gibson G A. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Trans Dependable Secure Comput, 2010, 7: 337-350 CrossRef Google Scholar
[30] Cappello F. Resilience: One of the Main Challenges for Exascale Computing. Technical Report of the INRIA-Illinois Joint Laboratory, 2011. Google Scholar
[31] Kusnezov D. DOE exascale Initiative. 2013. Google Scholar
[32] Asanovic K, Bodik R, Catanzaro B C, et al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report Uc Berkeley. eecs-2006-183. 2006. Google Scholar
[33] Chao Y, Wei X, Fu H, et al. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2016. 6. Google Scholar
[34] Qiao F, Zhao W, Yin X, et al. A highly effective global surface wave numerical simulation with ultra-high resolution. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2016. 5. Google Scholar
[35] Fu H, Liao J, Xue W, et al. Refactoring and optimizing the community atmosphere model (CAM) on the sunway taihulight supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2016. 83. Google Scholar
[36] Liu J, Qin H, Wang Y, et al. Largest particle simulations downgrade the runaway electron risk for ITER. 2016,. arXiv Google Scholar
[37] IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2016. 443--450. Google Scholar
[38] Duan X, Xu K, Chan Y, et al. S-Aligner: ultrascalable read mapping on Sunway Taihu Light. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER), 2017. Google Scholar
[39] Yao W J, Chen J S, Zhi-Chao S U, et al. Porting and optimizing of NAMD on SunwayTaihuLight System. Computer Engineering & Science, 2017. Google Scholar
Figure 1
(Color online) The mapping from exascale demands to Sunway architecture.
Figure 2
(Color online) The computing capabilities of top 3 supercomputers from 2008 to 2019. (a) Peak performance; (b) total number of cores; (c) peak performance of each core.
Figure 3
Heterogenous architecture of SW many-core processor.
Figure 4
(Color online) The computing super-node of Sunway supercomputer.
Figure 5
(Color online) The interconnect architecture of Sunway supercomputer.
Figure 6
(Color online) The energy efficiency of the first system in TOP500 lists and the Green500 lists from 2008 to 2019.
Figure 7
(Color online) Schematic illustration of low-power compilation.
Figure 8
(Color online) The ratio of M/C of the No.1 system from 2003 to 2019.
Figure 9
(Color online) The on-chip hierarchical memory architecture of SW many-core processors. GPR: general purpose register, LDM: local data memory, DMA: direct memory access, CPM: coherence process module, MM: main memory.
Figure 10
(Color online) The pan-tree structure of interconnect.
Figure 11
(Color online) Accelerated computing model with heterogeneous fusion.
Figure 12
(Color online) Schematic of initiative-passive fault-tolerance control mechanism.
Figure 13
(Color online) Hardware architecture of Sunway exascale supercomputer.
Super-node (256 CPUs) | 1 cabin (1024 CPUs) | 4 cabins (4096 CPUs) | Total system (40960 CPUs) | |
Linpack efficiency (%) | 82.52 | 80.8 | 77.9 | 74.15 |
Mainstream processors | SW 26010 | Intel Xeon | NVIDIA Kepler- | AMD GCN $^{\rm~2nd}$ | Intel Xeon |
Phi Knight | GK110B | gen Grenada XT | E7-8890 v4 | ||
Type | CPU | CPU | GPU | GPU | CPU |
Core Num | 260 | 72 | 2880 | 2816 | 24 |
Frequency (GHz) | 1.5 | 1.05 | 0.875 | 0.93 | 2.4 |
Peak FP6 (TFlops) | 3.618 | 1.01 | 1.68 | 2.62 | 0.92 |
Power efficiency (GF/W) | 10.559 | 4.49 | 7.15 | 9.53 | 5.58 |
Time (year) | 2014 | 2012 | 2014 | 2015 | 2016 |
Means | Shallow core hibernation | Array sleep | Full chip sleep |
Granularity | Single core | Computing elements array | Full chip |
Control mode | OS independent control | Out-of-band control | Out-of-band control |
Control overhead | ms | s | About 1 min |
Power saving (%) | 2 | 80 | 90 |
System | Sunway TaihuLight | Tianhe-2 | Titan | Sequoia | K |
Peak performance (PFlops) | 125.436 | 54.90 | 27.11 | 20.13 | 11.28 |
Linpack performance (PFlops) | 93.015 | 33.86 | 17.59 | 17.17 | 10.51 |
Performance per walt (MFlops/W) | 6051.3 | 1901.54 | 2142.77 | 2176.58 | 1062.69 |
Performance per cubic meter (TFlops/M3) | 523.1 | 174.1 | 69.9 | 67.8 | 10 |
Node architecture | One 260-core SW CPU with 4 MPEs and 256 CPEs | Two 12-core Intel CPUs and three 57-core Intel Xeon Phi Coprocessors | One 16-core AMD CPU and one K20x NVIDIA GPU (2688 CUDA cores) | One 16-core PowerPC CPU | One 8-core SPARC64 CPU |
Nodes | 256 | 256 | 512 | 512 | 1024 | 1024 |
Implementation | Software algorithm | Hardware implementation | Software algorithm | Hardware implementation | Software algorithm | Hardware implementation |
8B full reduction (AND synchronization) | 54.55 | 7.71 | 79.64 | 10.30 | 107.97 | 13.49 |
1 kB broadcast | 28.00 | 9.29 | 33.93 | 9.66 | 37.58 | 11.67 |
Type | Typical application and representative algorithm | Application scale of Sunway TaihuLight | Application performance of Sunway TaihuLight | Rank of Application Performance of Sunway TaihuLight | Prediction of exascale application scale | Prediction of exascale application performance |
Dense linear algebra | LINPACK | 12.288 millions | 93 PFlops | 1 | About 20 millions | About 700 PFlops |
Sparse linear algebra | HPCG | 343.5 billion nonzero elements | 480 TFlops | 3 | About $10^{12}$ nonzero elements | Over 3 PFlops |
Spectral methods | FFT | 16384$^3$ | 97 s/step | Maximum scale | 32768$^3$ | About 90 s/step |
Multi-body problem | Universe evolution | 11.2 thousand billion particles | 21.3 PFlops | Maximum scale | Hundreds of thousands of billions of particles | About 160 PFlops |
Structured grids | Stencil computing | $5.1\times10^{11}$ grids | 25.96 PFlops | 2016 Gordon Bell Prize | About 10$^{12}$ grids | About 200 PFlops |
Unstructured grids | Throughput computing | 10$^{10}$ grids | – | Parallelism of tens of millions of cores | About 10$^{11}$ grids | Parallelism of tens of millions of cores |
MapReduce | MapReduce | 10$^7$ task number | 12.5 PFlops | Maximum scale | About $2\times10^7$ task number | About 100 PFlops |
Traversal of graphs | BFS | 2$^{40}$ vertexs | 23755.7 GTEPS | 2 | 2$^{43}$ vertexs | About 14000 GTEPS |
Dynamic planning | Sequence comparison | 800 GB gene sequences | Fixed-point computing | Parallelism of tens of millions of cores | Parallelism of tens of millions of cores | Fixed-point computing |
Graphic models | Convolutional neural network | – | Core FP computing efficiencyof 94% | – | – | Core FP computing efficiency ofover 90% |