logo

More info
  • ReceivedAug 18, 2020
  • AcceptedNov 3, 2020
  • PublishedFeb 3, 2021

Abstract


Acknowledgment

This work was supported by National Key Research and Development Project of China (Grant No. 2016YFB- 0200500).


References

[1] Moore G E. Cramming more components onto integrated circuits, Reprinted from Electronics, volume 38, number 8, April 19, 1965, pp.114 ff.. IEEE Solid-State Circuits Soc Newsl, 2006, 11: 33-35 CrossRef Google Scholar

[2] Dennard R H, Gaensslen F H, Yu H N. Design of ion-implanted MOSFET's with very small physical dimensions. IEEE J Solid-State Circuits, 1974, 9: 256-268 CrossRef ADS Google Scholar

[3] Agerwala T. Challenges on the road to exascale computing. In: Proceedings of the 22nd Annual International Conference on Supercomputing, 2008. 2. Google Scholar

[4] Alvin K, Barrett B, Brightwell R. On the Path to Exascale. Int J Distributed Syst Technologies, 2010, 1: 1-22 CrossRef Google Scholar

[5] Beckman P. Looking toward exascale computing. In: Proceedings of the 9th International Conference on Parallel and Distributed Computing, Applications and Technologies, 2008. 3. Google Scholar

[6] Balaprakash P, Buntinas D, Chan A, et al. Exascale workload characterization and architecture implications. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2013. 120--121. Google Scholar

[7] Dally B. Power, programmability, and granularity: the challenges of exascale computing. In: Proceedings of IEEE International Test Conference, 2011. 12. Google Scholar

[8] Hluchy L, Bobák M, Müller H, et al. Heterogeneous exascale Computing. In: Recent Advances in Intelligent Engineering. Cham: Springer, 2020. 81--110. Google Scholar

[9] Kogge P, Shalf J. Exascale Computing Trends: Adjusting to the “New Normal”' for Computer Architecture. Comput Sci Eng, 2013, 15: 16-26 CrossRef ADS Google Scholar

[10] Lu Y. Paving the way for China exascale computing. CCF Trans HPC, 2019, 1: 63-72 CrossRef Google Scholar

[11] Shalf J, Dosanjh S S, Morrison J P. Exascale computing technology challenges. In: Proceedings of the 9th International Conference on High Performance Computing for Computational Science, 2010. 1--25. Google Scholar

[12] Vijayaraghavany T, Eckert Y, Loh G H, et al. Design and analysis of an APU for exascale computing. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), 2017. 85--96. Google Scholar

[13] Feng J Q, Gu W D, Pan J S. Parallel Implementation of BP Neural Network for Traffic Prediction on Sunway Blue Light Supercomputer. AMM, 2014, 614: 521-525 CrossRef Google Scholar

[14] Tian M, Gu W, Pan J, et al. Performance Analysis and Optimization of PalaBos on Petascale Sunway BlueLight MPP Supercomputer. In: Proceedings of International Conference on Parallel Computing in Fluid Dynamics, 2013. 311--320. Google Scholar

[15] Chen Y, Li K, Yang W. Performance-Aware Model for Sparse Matrix-Matrix Multiplication on the Sunway TaihuLight Supercomputer. IEEE Trans Parallel Distrib Syst, 2019, 30: 923-938 CrossRef Google Scholar

[16] Fang J, Fu H, Zhao W, et al. swDNN: a library for accelerating deep learning applications on Sunway TaihuLight. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2017. 615--624. Google Scholar

[17] Fu H, Liao J, Yang J. The Sunway TaihuLight supercomputer: system and applications. Sci China Inf Sci, 2016, 59: 072001 CrossRef Google Scholar

[18] Zhang J, Zhou C, Wang Y, et al. Extreme-scale phase field simulations of coarsening dynamics on the sunway taihulight supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2016. 4. Google Scholar

[19] Xu Y, Xie X H, Li H L. 一种面向高性能计算的自主众核处理器结构. Sci Sin-Inf, 2015, 45: 523-534 CrossRef Google Scholar

[20] Lin H, Zhu X, Yu B, et al. ShenTu: processing multi-trillion edge graphs on millions of cores in seconds. In: Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, 2018. 56. Google Scholar

[21] Meng D-L, Wen M-H, Wei J-W, et al. Porting and optimizing OpenFOAM on Sunway TaihuLight System. Comput Sci, 2017, 44: 64-70. doi: 10.11896/j.issn.1002-137X.2017.10.012. Google Scholar

[22] Fu H, Liu W, Wang L, et al. Redesigning CAM-SE for peta-scale climate modeling performance and ultra-high resolution on Sunway TaihuLight. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017. 1. Google Scholar

[23] Fu H, Yin W, Yang G, et al. 18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios. Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017. 2. Google Scholar

[24] Williams S W, Patterson D A, Oliker L, et al. The Roofline Model: A pedagogical tool for auto-tuning kernels on multicore architectures. 2008. Google Scholar

[25] Oral S, Vazhkudai S S, Wang F, et al. End-to-end I/O portfolio for the summit supercomputing ecosystem. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis, 2019. 1--14. Google Scholar

[26] Shi X, Li M, Liu W, et al. SSDUP: a traffic-aware ssd burst buffer for HPC systems. In: Proceedings of the International Conference on Supercomputing, 2017. 1--10. Google Scholar

[27] Shi X, Liu W, He L, et al. Optimizing the SSD Burst Buffer by Traffic Detection. ACM Transactions on Architecture and Code Optimization (TACO), 2020, 17: 1-26. Google Scholar

[28] He WQ, L Y, Fang YF, Wei D, Qi FB. Design and implementation of Parallel C programming language for domestic heterogeneous many-core systems. Ruan Jian Xue Bao/Journal of Software, 2017, 28(4): 764--785 DOI: 10.13328/j.cnki.jos.005197. Google Scholar

[29] Schroeder B, Gibson G A. A Large-Scale Study of Failures in High-Performance Computing Systems. IEEE Trans Dependable Secure Comput, 2010, 7: 337-350 CrossRef Google Scholar

[30] Cappello F. Resilience: One of the Main Challenges for Exascale Computing. Technical Report of the INRIA-Illinois Joint Laboratory, 2011. Google Scholar

[31] Kusnezov D. DOE exascale Initiative. 2013. Google Scholar

[32] Asanovic K, Bodik R, Catanzaro B C, et al. The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report Uc Berkeley. eecs-2006-183. 2006. Google Scholar

[33] Chao Y, Wei X, Fu H, et al. 10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2016. 6. Google Scholar

[34] Qiao F, Zhao W, Yin X, et al. A highly effective global surface wave numerical simulation with ultra-high resolution. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2016. 5. Google Scholar

[35] Fu H, Liao J, Xue W, et al. Refactoring and optimizing the community atmosphere model (CAM) on the sunway taihulight supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2016. 83. Google Scholar

[36] Liu J, Qin H, Wang Y, et al. Largest particle simulations downgrade the runaway electron risk for ITER. 2016,. arXiv Google Scholar

[37] IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS), 2016. 443--450. Google Scholar

[38] Duan X, Xu K, Chan Y, et al. S-Aligner: ultrascalable read mapping on Sunway Taihu Light. In: Proceedings of IEEE International Conference on Cluster Computing (CLUSTER), 2017. Google Scholar

[39] Yao W J, Chen J S, Zhi-Chao S U, et al. Porting and optimizing of NAMD on SunwayTaihuLight System. Computer Engineering & Science, 2017. Google Scholar

  • Figure 1

    (Color online) The mapping from exascale demands to Sunway architecture.

  • Figure 2

    (Color online) The computing capabilities of top 3 supercomputers from 2008 to 2019. (a) Peak performance; (b) total number of cores; (c) peak performance of each core.

  • Figure 3

    Heterogenous architecture of SW many-core processor.

  • Figure 4

    (Color online) The computing super-node of Sunway supercomputer.

  • Figure 5

    (Color online) The interconnect architecture of Sunway supercomputer.

  • Figure 6

    (Color online) The energy efficiency of the first system in TOP500 lists and the Green500 lists from 2008 to 2019.

  • Figure 7

    (Color online) Schematic illustration of low-power compilation.

  • Figure 8

    (Color online) The ratio of M/C of the No.1 system from 2003 to 2019.

  • Figure 9

    (Color online) The on-chip hierarchical memory architecture of SW many-core processors. GPR: general purpose register, LDM: local data memory, DMA: direct memory access, CPM: coherence process module, MM: main memory.

  • Figure 10

    (Color online) The pan-tree structure of interconnect.

  • Figure 11

    (Color online) Accelerated computing model with heterogeneous fusion.

  • Figure 12

    (Color online) Schematic of initiative-passive fault-tolerance control mechanism.

  • Figure 13

    (Color online) Hardware architecture of Sunway exascale supercomputer.

  • Table 1  

    Table 1The efficiency of computing node, super-node and the whole system of Sunway TaihuLight

    Super-node (256 CPUs) 1 cabin (1024 CPUs) 4 cabins (4096 CPUs) Total system (40960 CPUs)
    Linpack efficiency (%) 82.52 80.8 77.9 74.15
  • Table 2  

    Table 2Comparison of energy efficiency ratios of several mainstream processors

    Mainstream processorsSW 26010Intel Xeon NVIDIA Kepler- AMD GCN $^{\rm~2nd}$ Intel Xeon
    Phi KnightGK110B gen Grenada XT E7-8890 v4
    Type CPU CPU GPU GPU CPU
    Core Num 260 72 2880 2816 24
    Frequency (GHz) 1.5 1.05 0.875 0.93 2.4
    Peak FP6 (TFlops) 3.618 1.01 1.68 2.62 0.92
    Power efficiency (GF/W) 10.559 4.49 7.15 9.53 5.58
    Time (year) 2014 2012 2014 2015 2016
  • Table 3  

    Table 3Comparison of various sleep measures

    Means Shallow core hibernation Array sleep Full chip sleep
    Granularity Single core Computing elements array Full chip
    Control mode OS independent control Out-of-band control Out-of-band control
    Control overhead ms s About 1 min
    Power saving (%) 2 80 90
  • Table 4  

    Table 4A brief comparison between the Sunway TaihuLight and other large-scale systems (June, 2016)

    System Sunway TaihuLight Tianhe-2 Titan Sequoia K
    Peak performance (PFlops) 125.436 54.90 27.11 20.13 11.28
    Linpack performance (PFlops) 93.015 33.86 17.59 17.17 10.51
    Performance per walt (MFlops/W) 6051.3 1901.54 2142.77 2176.58 1062.69
    Performance per cubic meter (TFlops/M3) 523.1 174.1 69.9 67.8 10
    Node architecture One 260-core SW CPU with 4 MPEs and 256 CPEs Two 12-core Intel CPUs and three 57-core Intel Xeon Phi Coprocessors One 16-core AMD CPU and one K20x NVIDIA GPU (2688 CUDA cores) One 16-core PowerPC CPU One 8-core SPARC64 CPU
  • Table 5  

    Table 5Comparison of aggregated message latency implemented by the hardware and the software (unit: $\mu$s)

    Nodes 256 256 512 512 1024 1024
    Implementation Software algorithm Hardware implementation Software algorithm Hardware implementation Software algorithm Hardware implementation
    8B full reduction (AND synchronization) 54.55 7.71 79.64 10.30 107.97 13.49
    1 kB broadcast 28.00 9.29 33.93 9.66 37.58 11.67
  • Table 6  

    Table 6Scales and performances of ten types of scientific applications

    Type Typical application and representative algorithm Application scale of Sunway TaihuLight Application performance of Sunway TaihuLight Rank of Application Performance of Sunway TaihuLight Prediction of exascale application scale Prediction of exascale application performance
    Dense linear algebra LINPACK 12.288 millions 93 PFlops 1 About 20 millions About 700 PFlops
    Sparse linear algebra HPCG 343.5 billion nonzero elements 480 TFlops 3 About $10^{12}$ nonzero elements Over 3 PFlops
    Spectral methods FFT 16384$^3$ 97 s/step Maximum scale 32768$^3$ About 90 s/step
    Multi-body problem Universe evolution 11.2 thousand billion particles 21.3 PFlops Maximum scale Hundreds of thousands of billions of particles About 160 PFlops
    Structured grids Stencil computing $5.1\times10^{11}$ grids 25.96 PFlops 2016 Gordon Bell Prize [33] About 10$^{12}$ grids About 200 PFlops
    Unstructured grids Throughput computing 10$^{10}$ grids Parallelism of tens of millions of cores About 10$^{11}$ grids Parallelism of tens of millions of cores
    MapReduce MapReduce 10$^7$ task number 12.5 PFlops Maximum scale [36] About $2\times10^7$ task number About 100 PFlops
    Traversal of graphs BFS 2$^{40}$ vertexs 23755.7 GTEPS [20] 2 2$^{43}$ vertexs About 14000 GTEPS
    Dynamic planning Sequence comparison 800 GB gene sequences Fixed-point computing Parallelism of tens of millions of cores Parallelism of tens of millions of cores Fixed-point computing
    Graphic models Convolutional neural network Core FP computing efficiencyof 94% [16] Core FP computing efficiency ofover 90%
qqqq

Contact and support