logo

SCIENTIA SINICA Informationis, Volume 46 , Issue 9 : 1175-1210(2016) https://doi.org/10.1360/N112016-00147

Emerging High-Performance Computing Systems and Technology

More info
  • ReceivedJun 26, 2016
  • AcceptedAug 25, 2016
  • PublishedSep 18, 2016

Abstract


Funded by

国家自然科学基金(61433019)

国家自然科学基金(61402503)

国家自然科学基金(61170288)

国家自然科学基金(61332003)


References

[1] Meuer H, Strohmaier E, Dongarra J, et al. TOP500 Supercomputer Sites. http://www.top500.org. 2016. Google Scholar

[2] Lucas R, Ang J, Bergman K, et al. DOE advanced scientific computing advisory subcommittee (ASCAC) report: top ten exascale research challenges. http://www.osti.gov/scitech/biblio/1222713. 2014. Google Scholar

[3] Reinders J. Knights corner: your path to knights landing. Intel Developer Zone, 2014. Google Scholar

[4] Nvidia. Nvidia Tesla P100--the most advanced Datacenter Accelerator ever built featuring pascal GP100, the world's fastest GPU. Nvidia Whitepaper. Nvidia WP-08019-001{\_}v01.1. 2016. Google Scholar

[5] Lee Y J, Kim J, Jan H, et al. A fully associative, tagless DRAM cache. In: Proceedings of the 42nd International Symposium on Computer Architecture (ISCA2015). New York: ACM, 2015. 211-222. Google Scholar

[6] Jang H, Lee Y, Kim J, et al. Efficient footprint caching for tagless DRAM caches. In: Proceedings of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA2016), Barcelona, 2016. 237-248. Google Scholar

[7] Bhati I, Chishti Z, Lu S-L, et al. Flexible auto-refresh: enable scalable and energy-efficient DRAM refresh reductions. In: Proceedings of the 42nd International Symposium on Computer Architecture (ISCA2015). New York: ACM, 2015. 235-246. Google Scholar

[8] Yu X Y, Davadas S. Tardis: time traveling coherence algorithm for distributed shared memory. In: Proceedings of the 24th International Conference on Parallel Architecture and Compilation Techniques (PACT2015), San Francisco, 2015. 227-240. Google Scholar

[9] Balasubramonian R, Grot B. Near-data processing. IEEE Micro, 2016, 36: 4-5. Google Scholar

[10] Balfour J, Dally W J. Design tradeoffs for tiled cmp on-chip networks. In: Proceedings of the 20th International Conference on Supercomputing (ICS'06). New York: ACM, 2006. 187-198. Google Scholar

[11] Kim J, Balfour J, Dally W J. Flattened butterfly topology for on-chip networks. In: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'07), Chicago, 2007. 172-182. Google Scholar

[12] Demir Y, Hardavellas N. SLaC: stage laser control for a flattened butterfly network. In: Proceedings of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA2016), Barcelona, 2016. 321-332. Google Scholar

[13] Sodani A, Gramunt R, Corbal J, et al. Knight landing: second-generation Intel Xeon Phi product. IEEE Micro, 2016, 36: 34-46 CrossRef Google Scholar

[14] Amaru L G. New data structure and algorithms for logic synthesis and verification. Dissertation for Ph.D. Degree. Lausanne: Ecole Polytechnique Federale de Lausanne, 2015. Google Scholar

[15] Borkar S. Thousand core chip--a technology perspective. In: Proceedings of the 44th Design Automation Conference (DAC2007), San Diego, 2007. 746-749. Google Scholar

[16] Keckler S W, Dally W J, Khailany B, et al. GPUs and the future of parallel computing. IEEE Micro, 2011, 31: 7-17 CrossRef Google Scholar

[17] Yuffe M, Knoll E, Mehalel M, et al. A fully integrated multi-CPU, GPU and memory controller 32nm processor. In: Proceedings of IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, 2011. 264-266. Google Scholar

[18] Smith R, Goyal N, Ormont J, et al. Evaluating GPUs for network packet signature matching. In: Proceedings of IEEE International Symposium on Performance Analysis of Systems and Software, Boston, 2009. 175-184. Google Scholar

[19] Wang Z K, He B S, Zhang W, et al. A performance analysis framework for optimizing OpenCL applications on FPGAs. In: Proceedings of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA2016), Barcelona, 2016. 114-125. Google Scholar

[20] Gao M Y, Kozyrakis C. HRL: efficient and flexible reconfigurable logic for near-data processing. In: Proceedings of the 22nd IEEE Symposium on High Performance Computer Architecture (HPCA2016), Barcelona, 2016. 126-137. Google Scholar

[21] Komornicki A, Mullen-Schultz G, Landon D. Roadrunner: Hardware and software overview. USA: IBM Redbooks, 2009. Google Scholar

[22] Carter N P, Agrawal A, Borkar S, et al. Runnemede: an architecture for ubiquitous high-performance computing. In: Proceedings of the 19th IEEE International Symposium of High Performance Computer Architecture (HPCA2013), Shenzhen, 2013. 198-209. Google Scholar

[23] Mitsuhisa S. Feasibility Study on Future HPC Infrastructure. http://www.ccs.tsukuba.ac.jp/files/ex-review/FS-ccs-eval-2014.pdf. 2014. Google Scholar

[24] Homayoun H, Kontorinis V, Shayan A, et al. Dynamically hterogeneous cores through 3D resource pooling. In: Proceedings of the 18th IEEE International Symposium of High Performance Computer Architecture (HPCA2012), New Orleans, 2012. 277-288. Google Scholar

[25] Branover A, Foley D, Steinman M. AMD fusion APU: llano. IEEE Micro, 2012, 32: 28-37. Google Scholar

[26] Taylor M B. A landscape of the new dark silicon design regime. IEEE Micro, 2013, 33: 8-19. Google Scholar

[27] Merolla P A, Arthur J V, Alvarez-lcaza R, et al. A million spiking-neuron integrated circuit with a scalable communication network and interface. Science, 2014, 345: 668-673 CrossRef Google Scholar

[28] Prezioso M, Merrikh-Bayat F, Hoskins B D, et al. Training and operation of an integrated neuromorphic network based on metal-oxide memristors. Nature, 2015, 521: 61-64 CrossRef Google Scholar

[29] Khan M M, Lester D R, Plana L A, et al. SpiNNaker: mapping neural networks onto a massively-parallel chip multiprocessor. In: Proceedings of IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), Hong Kong, 2008. 2849-2856. Google Scholar

[30] Chen T, Du Z, Sun N, et al. Diannao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM Sigplan Notices, 2014, 49: 269-284 CrossRef Google Scholar

[31] Shen J C, Ma D, Gu Z H, et al. Darwin: a neuromorphic hardware co-processor based on Spiking Neural Networks. Sci China Inf Sci, 2016, 59: 023401. Google Scholar

[32] Kahle J. The cell processor architecture. In: Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture. Washington: IEEE Computer Society, 2005. 3. Google Scholar

[33] Nyberg P. The Cray rainier system: integrated scalar/vector computing. http://www.ecmwf.int/sites/default/files/ elibrary/2004/14161-thecray-rainier-system-integrated-scalarvector-computing.pdf. 2004. Google Scholar

[34] Intel Xeon+FPGA Platform for the Data Center. The 4th Workshop on the Intersections of Computer Architecture and Reconfigurable Logic. https://www.ece.cmu.edu/ calcm/carl/lib/exe/fetch.php?media=carl15-gupta.pdf. 2015. Google Scholar

[35] Wirbel L. Xilinx SDAccel--a Unified Development Environment for Tomorrow's Data Center. Technical Report, the Linley Group Inc, 2014. Google Scholar

[36] Shalf J, Quinlan D, Janssen C. Rethinking hardware-software codesign for exascale systems. Computer, 2011, 44: 22-30. Google Scholar

[37] Bertels K, Sima V M, Yankova Y, et al. HArtes: hardware-software codesign for heterogeneous multicore platforms. IEEE Micro, 2010, 5: 88-97. Google Scholar

[38] Kinsy M A, Devadas S. Heracles: a tool for fast RTL-based design space exploration of multicore processors. In: Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Monterey, 2013. 125-134. Google Scholar

[39] Mavroidis I, Mavroidis I, Papaefstathiou I, et al. FASTCUDA: open Source FPGA Accelerator Hardware-Software Codesign Toolset for CUDA Kernels. In: Proceedings of Euromicro Conference on Digital System Design, Funchal, 2012. 343-348. Google Scholar

[40] Kim G, Lee M, Jeong J, et al. Multi-GPU system design with memory networks. In: Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, 2014. 484-495. Google Scholar

[41] Kobayashi M, Seetharaman S, Parulkar G, et al. Maturing of OpenFlow and software-defined networking through deployments. Comput Netw Int J Comput Telecommun Netw, 2014, 61: 151-175. Google Scholar

[42] Huawei. High Throughput Computing Data Center Architecture--Thinking of Data Center 3.0. Technical White Paper, 2014. Google Scholar

[43] Minkenberg C, Rodriguez G. Trace-driven co-simulation of high-performance computing systems using OMNeT. In: Proceedings of the 2nd International Conference on Simulation Tools and Techniques, Belgium, 2009. 65. Google Scholar

[44] Mingyu H, Kevin P, Jie M, et al. SST + gem5 = a scalable simulation infrastructure for high performance computing. In: Proceedings of the 5th International ICST Conference on Simulation Tools and Techniques, Desenzano, 2012. 196-201. Google Scholar

[45] Zheng G, Kakulapati G, Kalé L V. BigSim: a parallel simulator for performance prediction of extremely large parallel machines. In: Proceedings of the 18th International Symposium on Parallel and Distributed Processing, Santa Fe, 2004. 78-87. Google Scholar

[46] Bhatele A, Jain N, Livnat Y, et al. Evaluating System Parameters on a Dragonfly using Simulation and Visualization. Technical Report, 2015. Google Scholar

[47] Besta M, Hoefler T. Slim fly: a cost effective low-diameter network topology. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC2014), New Orleans, 2014. 348-359. Google Scholar

[48] Kathareios G, Minkenberg C. Cost-effective diameter-two topologies: analysis and evaluation. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC2015), Austin, 2015. 36-46. Google Scholar

[49] Magnusson P S, Christensson M, Eskilson J, et al. Simics: a full system simulation platform. IEEE Comput, 2002, 2: 50-58. Google Scholar

[50] Alverson B, Froese E, Kaplan L, et al. Cray XC \textregistered Series Network. White Paper WP-Aries01-1112. http://www. cray.com/Assets/PDF/products/xc/CrayXC30Networking.pdf. 2012. Google Scholar

[51] Ajima Y, Inoue T, Hiramoto S, et al. Tofu interconnect 2: system-on-chip integration of high-performance interconnect. In: Supercomputing. Berlin: Springer, 2014. 498-507. Google Scholar

[52] Mellanox Technologies. Switch-IB\texttrademark EDR Switch Silicon. Product Brief, 2014. Google Scholar

[53] Birrittella M S, Debbage M, Huggahalli R, et al. Intel\textregistered Omni-path architecture enabling scalable, high performance fabrics. In: Proceedings of IEEE Symposium on High-performance Interconnects, Santa Clara, 2015. 1-9. Google Scholar

[54] Tracy N, Wuth T. OIF Next Generation Interconnect Framework. OIF-FD- Client-400G/1T-01.0. http://www. oiforum.com/public/documents/OIF-FD-Client-400G-1T-01.0.pdf. 2013. Google Scholar

[55] Vinaik B, Puri R. Oracle's Sonoma Processor: Advanced Low-cost SPARC Processor for Enterprise Workloads. In: Proceedings of IEEE Hot Chips 27 Symposium (HCS), Cupertino, 2015. 1-23. Google Scholar

[56] PLX. PCI Express Gen3 Switch. Product Brief. https://www.synopsys.com/dw/doc.php/ss/plx\_ss.pdf. 2012. Google Scholar

[57] Regula J, Subramaniyan M, Dodson J. Integrating rack level connectivity into a PCIe switch. In: Proceedings of International Symposium on High Performance Chips, Stanford, 2013. 259-266. Google Scholar

[58] Kumar M J. Rack Scale Architecture for Cloud Keynote of Intel Developer Forum (IDF). http://blog. scottlowe.org/2013/09/11/idf-2013-rack-scale-architecture-for-cloud/. 2013. Google Scholar

[59] Assefa S, Xia F. Reinventing germanium avalanche photo-detector for nanophotonic on-chip optical interconnects. Nature, 2010, 464: 80-84 CrossRef Google Scholar

[60] Lee B G, Rylyakov A V, Green W M J, et al. Monolithic silicon integration of scaled photonic switch fabrics, CMOS logic, and device driver circuits. J Light Wave Tech, 2014, 32: 743-751 CrossRef Google Scholar

[61] Barwicz T, Taira Y, Lichoulas T W, et al. Photonic packaging in high-throughput microelectronic assembly lines for cost-efficiency and scalability. In: Proceedings of Optical Fiber Communications Conference and Exhibition (OFC), Los Angeles, 2015. W3H4. Google Scholar

[62] Krishnamoorthy A V, Ho R, Zheng X, et al. Computer systems based on silicon photonic interconnect. Proc IEEE, 2009, 97: 1337-1361 CrossRef Google Scholar

[63] Dobbelaere P D. Silicon photonics technology platform for integration of optical IOs with ASICs. In: Proceedings of International Symposium on High Performance Chips, Stanford, 2013. 115-123. Google Scholar

[64] Doany F E, Lee B G, Kuchta D M, et al. Terabit/Sec VCSEL-based 48-channel optical module based on holey CMOS transceiver IC. J Light Wave Tech, 2013, 31: 672-680 CrossRef Google Scholar

[65] Sun C, Wade M T, Lee Y, et al. Single-chip microprocessor that communicates directly using light. Nature, 2015, 528: 534-538 CrossRef Google Scholar

[66] Polatis. SERIES 7000-384$\times$384 port software-defined optical circuit switch. Product Brief, 2016. Google Scholar

[67] Cheung S, Su T, Okamoto K, et al. Ultra-compact silicon photonic 512$\times$512 25-GHz arrayed waveguide grating router. IEEE J Sel Topics Quant Electron, 2014, 20: 310-316 CrossRef Google Scholar

[68] Yang X J. Sixty Years of Parallel Computing. Comput Eng Sci, 2012, 34: 1-10 [杨学军. 并行计算六十年. 计算机工程与科学, 2012, 34: 1-10]. Google Scholar

[69] International technology roadmap for semiconductors (itrs). http://www.itrs2.net/2013-itrs.html. Google Scholar

[70] Tolentino M E, Turner J, Cameron K W. Memory MISER: improving main memory energy efficiency in servers. IEEE Trans Comput, 2009, 58: 336-350 CrossRef Google Scholar

[71] Shinnar A, Cunningham D, Saraswat V, et al. M3R: increased performance for in-memory Hadoop jobs. Proc VLDB Endowment, 2012, 5: 1736-1747 CrossRef Google Scholar

[72] Mittal S, Vetter J S. A survey of architectural approaches for data compression in cache and main memory systems. IEEE Trans Parall Distrib Syst, 2016, 27: 1524-1536 CrossRef Google Scholar

[73] Lee B C, Zhou P, Yang J, et al. Phase-change technology and the future of main memory. IEEE Micro, 2010, 30: 143-1536 CrossRef Google Scholar

[74] Zhang L, Zhu M, Huang R, et al. Forming-less unipolar TaOx-based RRAM with large CC-independence range for high density memory applications. ECS Trans, 2010, 27: 3-8. Google Scholar

[75] Li H, Xi H, Chen Y, et al. Thermal-assisted spin transfer torque memory (STT-RAM) cell design exploration. In: Proceedings of IEEE Computer Society Annual Symposium on VLSI, Tampa, 2009. 217-222. Google Scholar

[76] Corporation I. OpenMP Application Program Interface, version 4.5. OpenMP Architecture Review Board. http://www.openmp.org/mp-documents/openmp-4.5.pdf. 2015. Google Scholar

[77] Carter N P, Agrawal A, Borkar S, et al. Runnemede: an architecture for ubiquitous high-performance computing. In: Proceedings of High Performance Computer Architecture (HPCA2013), Shenzhen, 2013. 198-209. Google Scholar

[78] Vetter J S, Mittal S. Opportunities for nonvolatile memory systems in extreme-scale high-performance computing. Comput Sci Eng, 2015, 17: 73-82 CrossRef Google Scholar

[79] Xu W, Lu Y, Li Q, et al. Hybrid hierarchy storage system in MilkyWay-2 supercomputer. Front Comput Sci, 2014, 8: 367-377 CrossRef Google Scholar

[80] Lee B C, Zhou P, Yang J, et al. Phase-change technology and the future of main memory. IEEE Micro, 2010, 30: 143-377 CrossRef Google Scholar

[81] Carlson W W, Draper J M, Culler D E, et al. Introduction to UPC and Language Specification. Technical Report CCS-TR-99-157. 1999. Google Scholar

[82] Numrich R W, Reid J. Co-array fortran for parallel programming. ACM Sigplan Fortran Forum, 1998, 17: 1-31 CrossRef Google Scholar

[83] Nieplocha J, Harrison R J, Littlefield R J. Global arrays: a portable shared-memory programming model for distributed memory computers. In: Proceedings of the ACM/IEEE Conference on Supercomputing, Los Alamitos, 1994. 340-349. Google Scholar

[84] Chamberlain B L, Callahan D, Zima H P. Parallel programmability and the chapel language. Int J High Perfor Comput Appl, 2007, 21: 291-312 CrossRef Google Scholar

[85] Charles P, Grothoff C, Saraswat V, et al. X10: an object-oriented approach to non-uniform cluster computing. In: Proceedings of the 20th Annual ACM SIGPLAN Conference on Object Oriented Programming Systems Languages and Applications, New York, 2005. 519-538. Google Scholar

[86] Steele Jr G L, Allen E, Chase D, et al. Fortress (Sun HPCS language). In: Encyclopedia of Parallel Computing. Berlin: Springer, 2011. 718-735. Google Scholar

[87] Wienke S, Springer P, Terboven C, et al. OpenACC--first experiences with real-world applications. In: Proceedings of Euro-Par 2012 Parallel Processing, Pittsburgh, 2012. 859-870. Google Scholar

[88] David K. NVIDIA CUDA software and GPU parallel computing architecture. In: Proceedings of the 6th International Symposium on Memory Management, Vancouver, 2007. 103-104. Google Scholar

[89] Gaster B, Howes L, Kaeli D R, et al. Heterogeneous Computing with OpenCL: Revised OpenCL 1. London: Newnes, 2012. Google Scholar

[90] Amarasinghe S, Hall M, Lethin R, et al. ASCR programming challenges for exascale computing. In: Report of the 2011 Workshop on Exascale Programming Challenges, Marina del Rey, 2011. Google Scholar

[91] Hou Q, Zhou K, Guo B. BSGP: bulk-synchronous GPU programming. ACM Trans Graph, 2008, 27: 19. Google Scholar

[92] Chen L, Liu L, Tang S, et al. Unified parallel C for GPU clusters: language extensions and compiler implementation. In: Proceedings of International Workshop on Languages and Compilers for Parallel Computing. Berlin: Springer, 2010. 151-165. Google Scholar

[93] Yang C, Wang F, Du Y, et al. Adaptive optimization for petascale heterogeneous CPU/GPU computing. In: Proceedings of 2010 IEEE International Conference on Cluster Computing, Chemnitz, 2010. 19-28. Google Scholar

[94] Gong C Y, Bao W M, Liu H D, et al. A survey of the parallel solutions to convection-diffusion equation. Comput Eng Sci, 2015 37: 628-633 [龚春叶, 包为民, 刘海东, 等. 对流扩散方程并行求解方法研究综述. 计算机工程与科学, 2015, 37: 628-633]. Google Scholar

[95] Gong C, Liu J, Chi L, et al. GPU accelerated simulations of 3D deterministic particle transport using discrete ordinates method. J Comput Phys, 2011, 230: 6010-6022 CrossRef Google Scholar

[96] Wang Q L, Liu J, Gong C Y, et al. Scalability of 3D deterministic particle transport on the Intel MIC architecture. Nucl Sci Tech, 2015, 26: 050502. Google Scholar

[97] Gong C, Bao W, Tang G. A parallel algorithm for the Riesz fractional reaction-diffusion equation with explicit finite difference method. Fract Calc Appl Anal, 2013, 16: 654-669. Google Scholar

[98] Gong C, Bao W, Tang G, et al. An efficient parallel solution for Caputo fractional reaction-diffusion equation. J Supercomput, 2014, 68: 1521-1537 CrossRef Google Scholar

[99] Yang B, Lu K, Gao Y, et al. GPU acceleration of subgraph isomorphism search in large scale graph. J Central South Univ, 2015, 22: 2238-2249 CrossRef Google Scholar

[100] Wu Q, Yang C, Tang T, et al. Exploiting hierarchy parallelism for molecular dynamics on a petascale heterogeneous system. J Parall Distrib Comput, 2013, 73: 1592-1604 CrossRef Google Scholar

[101] Xu C, Deng X, Zhang L, et al. Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer. J Comput Phys, 2014, 278: 275-297 CrossRef Google Scholar

[102] Barker K J, Davis K, Hoisie A, et al. Entering the petaflop era: the architecture and performance of Roadrunner. In: Proceedings of the ACM/IEEE Conference on Supercomputing. Piscataway: IEEE Press, 2008. Google Scholar

[103] Gong C, Liu J, Huang H, et al. Particle transport with unstructured grid on GPU. Comput Phys Commun, 2012, 183: 588-593 CrossRef Google Scholar

[104] Yan J, Tan G M, Sun N H. Optimizing parallel S n sweeps on unstructured grids for multi-core clusters. J Comput Sci Tech, 2013, 28: 657-670 CrossRef Google Scholar

[105] Gong C, Bao W, Liu J, et al. An efficient wavefront parallel algorithm for structured three dimensional LU-SGS. Comput Fluids, 2016, 134: 23-30. Google Scholar

[106] Ghysels P, Vanroose W. Hiding global synchronization latency in the preconditioned conjugate gradient algorithm. Parall Comput, 2014, 40: 224-238 CrossRef Google Scholar

[107] Liu Y, Yang C, Liu F, et al. 623 Tflop/s HPCG run on Tianhe-2: Leveraging millions of hybrid cores. Int J High Perfor Comput Appl, 2016, 30: 39-54 CrossRef Google Scholar

[108] Liu J, Gong C, Bao W, et al. Solving the Caputo fractional reaction-diffusion equation on GPU. Discrete Dyn Nat Soc, 2014, 2014: 1-7. Google Scholar

[109] Jia W, Fu J, Cao Z, et al. Fast plane wave density functional theory molecular dynamics calculations on multi-GPU machines. J Comput Phys, 2013, 251: 102-115 CrossRef Google Scholar

[110] Liu Y Q, Li Y, Zhang Y Q, et al. Memory efficient two-pass 3D FFT algorithm for Intel\textregistered Xeon PhiTM coprocessor. J Comput Sci Tech, 2014, 29: 989-1002 CrossRef Google Scholar

[111] 刘鑫. 面向化学非平衡流的CFD并行计算技术和大规模并行计算平台研究. 博士学位论文. 郑州: 解放军信息工程大学. 2006. Google Scholar

[112] Lee M, Malaya N, Moser R D. Petascale direct numerical simulation of turbulent channel flow on up to 786k cores. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Denver, 2013. 61. Google Scholar

[113] Himeno R. Grand challenge in life science on K computer. In: Proceedings of International Conference on High Performance Computing for Computational Science, Kope, 2013. 17-22. Google Scholar

[114] Lei X L, Zhang T, Zhao Y, et al. The Application on Large-Scale Scientific Computing and Complex Engineering Based on TH-1A. Comput Eng Sci, 2012, 34: 176-183 [雷秀丽, 张婷, 赵洋, 等. ``天河一号"大规模科学与工程计算应用. 计算机工程与科学, 2012, 34: 176-183]. Google Scholar

[115] Gong C Y, Bao W M, Tang G J. Recent progress in high-performance parallel computing of the aerospace area. Comput Eng Sci, 2014, 36: 1629-1636 [龚春叶, 包为民, 汤国建, 等. 航天领域高性能并行计算研究进展. 计算机工程与科学, 2014, 36: 1629-1636]. Google Scholar

[116] Bermejo-Moreno I, Bodart J, Larsson J, et al. Solving the compressible Navier-Stokes equations on up to 1.97 million cores and 4.1 trillion grid points. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. New York: ACM, 2013. 62. Google Scholar

[117] Che Y, Xu C, Fang J, et al. Realistic performance characterization of CFD applications on Intel many integrated core architecture. Comput J, 2015, 58: 3279-3294 CrossRef Google Scholar

[118] Breuer A, Heinecke A, Rettenberger S, et al. Sustained petascale performance of seismic simulations with seissol on supermuc. In: Proceedings of Supercomputing Conference, New Orleans, 2014. 1-18. Google Scholar

[119] Coates A, Huval B, Wang T, et al. Deep learning with COTS HPC systems. In: Proceedings of the 30th International Conference on Machine Learning, Atlanta, 2013. 1337-1345. Google Scholar

[120] Shaw D E, Grossman J P, Bank J A, et al. Anton 2: raising the bar for performance and programmability in a special-purpose molecular dynamics supercomputer. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, 2014. 41-53. Google Scholar

[121] Staar P, Maier T A, Summers M S, et al. Taking a quantum leap in time to solution for simulations of high-Tc superconductors. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Denver, 2013. 1. Google Scholar

[122] Bernaschi M, Bisson M, Fatica M, et al. 20 petaflops simulation of proteins suspensions in crowding conditions. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Denver, 2013. 2. Google Scholar

[123] Randles A, Draeger E W, Oppelstrup T, et al. Massively parallel models of the human circulatory system. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, Austin, 2015. 1. Google Scholar

[124] Berczik P, Spurzem R, Zhong S, et al. Up to 700k GPU cores, Kepler, and the Exascale future for simulations of star clusters around black holes In: Supercomputing. Berlin: Springer, 2013. 13-25. Google Scholar

[125] Rossinelli D, Hejazialhosseini B, Hadjidoukas P, et al. 11 PFLOP/s simulations of cloud cavitation collapse. In: Proceedings of High Performance Computing, Networking, Storage and Analysis (SC), Denver, 2013. 1-13. Google Scholar

[126] Bussmann M, Burau H, Cowan T E, et al. Radiative signatures of the relativistic Kelvin-Helmholtz instability. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, Denver, 2013. 5. Google Scholar

[127] Bédorf J, Gaburov E, Fujii M S, et al. 24.77 pflops on a gravitational tree-code to simulate the Milky Way galaxy with 18600 GPUs. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, New Orleans, 2014. 54-65. Google Scholar

[128] Dongarra J J, Du Croz J, Hammarling S, et al. A set of level 3 basic linear algebra subprograms. ACM Trans Math Softw, 1990, 16: 1-17 CrossRef Google Scholar

[129] Balay S, Abhyankar S, Adams M, et al. PETSc Users Manual Revision 3.5. Technical Report, ANL-95/11 Rev. 3.5 108682. 2014. Google Scholar

[130] Blackford L S, Choi J, Cleary A, et al. ScaLAPACK users' guide. Siam, 1997. Google Scholar

[131] Frigo M, Johnson S G. FFTW: an adaptive software architecture for the FFT. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal, Seattle, 1998. 3: 1381-1384. Google Scholar

[132] Mo Z, Zhang A, Cao X, et al. JASMIN: a parallel software infrastructure for scientific computing. Front Comput Sci China, 2010, 4: 480-488 CrossRef Google Scholar

[133] Jasak H, Jemcov A, Tukovic Z. OpenFOAM: a C++ library for complex physics simulations. In: Proceedings of International Workshop on Coupled Methods in Numerical Dynamics, Dubrovnik, 2007. 1000: 1-20. Google Scholar

[134] Nelson M T, Humphrey W, Gursoy A, et al. NAMD: a parallel, object-oriented molecular dynamics program. Int J High Perfor Comput Appl, 1996, 10: 251-268 CrossRef Google Scholar

[135] Plimpton S, Crozier P, Thompson A. LAMMPS-large-scale atomic/molecular massively parallel simulator. Sandia National Laboratories, 2007, 18. Google Scholar

[136] Berendsen H J C, van der Spoel D, van Drunen R. GROMACS: a message-passing parallel molecular dynamics implementation. Comput Phys Commun, 1995, 91: 43-56 CrossRef Google Scholar