logo

SCIENCE CHINA Information Sciences, Volume 64 , Issue 6 : 160404(2021) https://doi.org/10.1007/s11432-020-3227-1

Breaking the von Neumann bottleneck: architecture-level processing-in-memory technology

More info
  • ReceivedDec 31, 2020
  • AcceptedMar 23, 2021
  • PublishedApr 27, 2021

Abstract


Acknowledgment

This work was supported by National Key RD Program of China (Grant No. 2018YFA0701500), Zhejiang Lab (Grant No. 2019KC0AB010), Key Research Program of Frontier Sciences, CAS (Grant No. ZDBS-LY-JSC012), Strategic Priority Research Program of CAS (Grant No. XDB44000000), Youth Innovation Promotion Association CAS, Beijing Academy of Artificial Intelligence (BAAI), Anhui Natural Science Foundation (Grant No. 2008085QF330), and Research Program of Anhui Normal University (Grant No. 751968).


References

[1] Mittal S. A Survey of ReRAM-Based Architectures for Processing-In-Memory and Neural Networks. MAKE, 2018, 1: 75-114 CrossRef Google Scholar

[2] Chen L R, Li J W, Chen Y R, et al. Accelerator-friendly neural-network training: learning variations and defects in RRAM crossbar. In: Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), Lausanne, 2017. 19--24. Google Scholar

[3] Chen W H, Li K X, Lin W Y, et al. A 65 nm 1 Mb nonvolatile computing-in-memory ReRAM macro with sub-16 ns multiply-and-accumulate for binary DNN AI edge processors. In: Proceedings of 2018 IEEE International Solid-State Circuits Conference, San Francisco, 2018. 494--496. Google Scholar

[4] Cai F, Correll J M, Lee S H. A fully integrated reprogrammable memristor-CMOS system for efficient multiply-accumulate operations. Nat Electron, 2019, 2: 290-299 CrossRef Google Scholar

[5] Yao P, Wu H, Gao B. Fully hardware-implemented memristor convolutional neural network. Nature, 2020, 577: 641-646 CrossRef PubMed ADS Google Scholar

[6] Burr G W, Shelby R M, Sidler S, et al. Experimental demonstration and tolerancing of a large-scale neural network (165000 synapses) using phase-change memory as the synaptic weight element. In: Proceedings of IEEE International Electron Devices Meeting, 2015. 3498--3507. Google Scholar

[7] Guo X, Bayat F M, Bavandpour M, et al. Fast, energy-efficient, robust, and reproducible mixed-signal neuromorphic classifier based on embedded NOR flash memory technology. In: Proceedings of 2017 IEEE International Electron Devices Meeting (IEDM), San Francisco, 2017. 1--4. Google Scholar

[8] Jiang Z, Yin S, Seo J S, et al. XNOR-SRAM: in-bitcell computing SRAM Macro based on resistive computing mechanism. In: Proceedings of the 2019 on Great Lakes Symposium on VLSI, 2019. 417--422. Google Scholar

[9] Valavi H, Ramadge P J, Nestler E. A 64-Tile 2.4-Mb In-Memory-Computing CNN Accelerator Employing Charge-Domain Compute. IEEE J Solid-State Circuits, 2019, 54: 1789-1799 CrossRef ADS Google Scholar

[10] Seshadri V, Lee D, Mullins T, et al. Ambit: in-memory accelerator for bulk bitwise operations using commodity DRAM technology. In: Proceedings of 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Boston, 2017. 273--287. Google Scholar

[11] Li S, Niu D, Malladi K T, et al. DRISA: a DRAM-based reconfigurable in-situ accelerator. In: Proceedings of 2017 50th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Boston, 2017. 288--301. Google Scholar

[12] Angizi S, Fan D. ReDRAM: a reconfigurable processing-in-DRAM platform for accelerating bulk bit-wise operations. In: Proceedings of 2019 IEEE/ACM International Conference on Computer-Aided Design (ICCAD), Westminster, 2019. 1--8. Google Scholar

[13] Kautz W H. Cellular Logic-in-Memory Arrays. IEEE Trans Comput, 1969, C-18: 719-727 CrossRef Google Scholar

[14] Stone H S. A Logic-in-Memory Computer. IEEE Trans Comput, 1970, C-19: 73-78 CrossRef Google Scholar

[15] Singh G, Chelini L, Corda S. Near-memory computing: Past, present, and future. Microprocessors MicroSyst, 2019, 71: 102868 CrossRef Google Scholar

[16] Jeddeloh J, Keeth B. Hybrid memory cube new DRAM architecture increases density and performance. In: Proceedings of Symposium on VLSI Technology (VLSIT), 2012. Google Scholar

[17] Dong U L, Kyung W K, Kwan W K, et al., 25.2 A 1.2 V 8 Gb 8-channel 128 GB/s high-bandwidth memory (HBM) stacked DRAM with effective microbump I/O test methods using 29 nm process and TSV. In: Proceedings of 2014 IEEE International Solid-State Circuits Conference Digest of Technical Papers (ISSCC), San Francisco, 2014. 432--433. Google Scholar

[18] Devaux F. The true processing in memory accelerator. In: Proceedings of 2019 IEEE Hot Chips 31 Symposium (HCS), Cupertino, 2019. 1--24. Google Scholar

[19] Consortium. Hybrid memory cube specification 2.1, 2015. Google Scholar

[20] Zhuo Y, Wang C, Zhang M, et al. GraphQ: scalable PIM-based graph processing. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium, 2019. 712--725. Google Scholar

[21] He M, Song C, Kim I, et al. Newton: a DRAM-maker's accelerator-in-memory (AiM) architecture for machine learning. In: Proceedings of 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, 2020. 372--385. Google Scholar

[22] Boroumand A, Zheng H, Mutlu O, et al. CoNDA: efficient cache coherence support for near-data accelerators. In: Proceedings of 2019 ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA), Phoenix, 2019. 629--642. Google Scholar

[23] Ahn J, Yoo S, Mutlu O, et al. PIM-enabled instructions: a low-overhead, locality-aware processing-in-memory architecture. In: Proceedings of 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, 2015. 336--348. Google Scholar

[24] Cheng L, Muralimanohar N, Ramani K, et al. Interconnect-aware coherence protocols for chip multiprocessors. In: Proceedings of the 33rd International Symposium on Computer Architecture (ISCA), Boston, 2006. 339--351. Google Scholar

[25] Baer J L, Wang W H. On the inclusion properties for multi-level cache hierarchies. In: Proceedings of the 15th Annual International Symposium on Computer Architecture. Conference Proceedings, Honolulu, 1988. 73--80. Google Scholar

[26] Imani M, Gupta S, Rosing T. Ultra-efficient processing in-memory for data intensive applications. In: Proceedings of 2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC), Austin, 2017. 1--6. Google Scholar

[27] Azarkhish E, Rossi D, Loi I, et al. Design and evaluation of a processing-in-memory architecture for the smart memory cube. In: Proceedings of International Conference on Architecture of Computing Systems. Berlin: Springer, 2016. Google Scholar

[28] Farmahini-Farahani A, Ahn J H, Morrow K, et al. NDA: near-DRAM acceleration architecture leveraging commodity DRAM devices and standard memory modules. In: Proceedings of 2015 IEEE 21st International Symposium on High Performance Computer Architecture (HPCA), Burlingame, 2015. 283--295. Google Scholar

[29] Boroumand A, Ghose S, Patel M. LazyPIM: An Efficient Cache Coherence Mechanism for Processing-in-Memory. IEEE Comput Arch Lett, 2017, 16: 46-50 CrossRef Google Scholar

[30] Xu S, Chen X, Wang Y, et al. CuckooPIM: an efficient and less-blocking coherence mechanism for processing-in-memory systems. In: Proceedings of the 24th Asia and South Pacific Design Automation Conference (ASPDAC'19). New York: Association for Computing Machinery, 2019. 140--145. Google Scholar

[31] Xu S, Wang Y, Han Y, et al. PIMCH: cooperative memory prefetching in processing-in-memory architecture. In: Proceedings of 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), Jeju, 2018. 209--214. Google Scholar

[32] Nesbit K J, Smith J E. Data Cache Prefetching Using a Global History Buffer. IEEE Micro, 2005, 25: 90-97 CrossRef Google Scholar

[33] Ishii Y, Inaba M, Hiraki K. Access Map Pattern Matching for High Performance Data Cache Prefetch. Journal of Instruction-Level Parallelism, 2011, 13: 499-500 https://doi.org/10.1145/1542275.1542349. Google Scholar

[34] Ahn J, Hong S, Yoo S, et al. A scalable processing-in-memory accelerator for parallel graph processing. In: Proceedings of 2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA), Portland, 2015. 105--117. Google Scholar

[35] Xu S, Chen X, Han Y, et al. TUPIM: a transparent and universal processing-in-memory architecture for unmodified binaries. In: Proceedings of the 2020 on Great Lakes Symposium on VLSI (GLSVLSI'20). New York: Association for Computing Machinery, 2020. 199--204. Google Scholar

[36] Oliveira G F, Santos P C, Alves M A Z, et al. A generic processing in memory cycle accurate simulator under hybrid memory cube architecture. In: Proceedings of 2017 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS), Pythagorion, 2017. 54--61. Google Scholar

[37] Kim Y, Yang W, Mutlu O. Ramulator: A Fast and Extensible DRAM Simulator. IEEE Comput Arch Lett, 2016, 15: 45-49 CrossRef Google Scholar

[38] Singh G, Gómez-Luna J, Mariani G, et al. NAPEL: Near-Memory Computing Application Performance Prediction via Ensemble Learning. In: Proceedings of 2019 56th ACM/IEEE Design Automation Conference (DAC), Las Vegas, 2019. 1--6. Google Scholar

[39] Xu S, Chen X, Wang Y. PIMSim: A Flexible and Detailed Processing-in-Memory Simulator. IEEE Comput Arch Lett, 2019, 18: 6-9 CrossRef Google Scholar

[40] Binkert N, Beckmann B, Black G. The gem5 simulator. SIGARCH Comput Archit News, 2011, 39: 1-7 CrossRef Google Scholar

[41] Sanchez D, Kozyrakis C. ZSim. SIGARCH Comput Archit News, 2013, 41: 475-486 CrossRef Google Scholar

[42] Coudrain P, Charbonnier J, Garnier A, et al. Active interposer technology for chiplet-based advanced 3D system architectures. In: Proceedings of 2019 IEEE 69th Electronic Components and Technology Conference (ECTC), Las Vegas, 2019. 569--578. Google Scholar

[43] Shen X, Xia Z, Yang T. Hydrogen Source and Diffusion Path for Poly-Si Channel Passivation in Xtacking 3D NAND Flash Memory. IEEE J Electron Devices Soc, 2020, 8: 1021-1024 CrossRef Google Scholar

qqqq

Contact and support