logo

SCIENCE CHINA Information Sciences, Volume 64 , Issue 6 : 160409(2021) https://doi.org/10.1007/s11432-020-3248-y

Towards efficient allocation of graph convolutional networks on hybrid computation-in-memory architecture

More info
  • ReceivedDec 31, 2020
  • AcceptedApr 15, 2021
  • PublishedMay 10, 2021

Abstract


Acknowledgment

This work was supported in part by National Natural Science Foundation of China (Grant No. 61972259), and Guangdong Basic and Applied Basic Research Foundation (Grant Nos. 2019B151502055, 2017B030314073, 2018B030325002).


References

[1] Kipf T N, Welling M. Semi-supervised classification with graph convolutional networks. In: Proceedings of the 5th International Conference on Learning Representations (ICLR), 2017. 1--14. Google Scholar

[2] Xie P, Sun G, Wang F, et al. V-PIM: an analytical overhead model for processing-in-memory architectures. In: Proceedings of IEEE 7th Non-Volatile Memory Systems and Applications Symposium (NVMSA), 2018. 107--108. Google Scholar

[3] Roy A, Mihailovic I, Zwaenepoel W. X-Stream: edge-centric graph processing using streaming partitions. In: Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP), 2013. 472--488. Google Scholar

[4] Yuan P, Zhang W, Xie C, et al. Fast iterative graph computation: a path centric approach. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC), 2014. 401--412. Google Scholar

[5] Liu M, Gao H, Ji S. Towards deeper graph neural networks. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2020. 338--348. Google Scholar

[6] Xu B, Shen H, Cao Q, et al. Graph wavelet neural network. In: Proceedings of the 7th International Conference on Learning Representations (ICLR), 2019. 1--13. Google Scholar

[7] Wu F, Zhang T Y, de Souza J A H, et al. Simplifying graph convolutional networks. In: Proceedings of the 36th International Conference on Machine Learning (ICML), 2019. 6861--6871. Google Scholar

[8] Kim D, Kung J, Chai S, et al. Neurocube: a programmable digital neuromorphic architecture with high-density 3D memory. In: Proceedings of ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), 2016. 380--392. Google Scholar

[9] Arai J, Shiokawa H, Yamamuro T, et al. Rabbit order: just-in-time parallel reordering for fast graph analysis. In: Proceedings of IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2016. 22--31. Google Scholar

[10] Paszke A, Gross S, Massa F, et al. PyTorch: an imperative style, high-performance deep learning library. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 8026--8037. Google Scholar

[11] Pawlowski J T. Hybrid memory cube (HMC. In: Proceedings of IEEE Hot Chips 23 Symposium (HCS), 2011. 1--24. Google Scholar

[12] Fey M, Lenssen J E. Fast graph representation learning with PyTorch Geometric In: Proceedings of ICLR Workshop on Representation Learning on Graphs and Manifolds, 2019. 1--9. Google Scholar

[13] Topcuoglu H, Hariri S, Min-You Wu S. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib Syst, 2002, 13: 260-274 CrossRef Google Scholar

[14] Xu D, Liao Y, Wang Y, et al. Selective off-loading to memory: task partitioning and mapping for PIM-enabled heterogeneous systems. In: Proceedings of the Computing Frontiers Conference (CF), 2017. 255--258. Google Scholar

[15] Chang F, Dean J, Ghemawat S. Bigtable. ACM Trans Comput Syst, 2008, 26: 1-26 CrossRef Google Scholar

[16] Zhang B, Zeng H, Prasanna V. Accelerating large scale GCN inference on FPGA In: Proceedings of IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM), 2020. 241. Google Scholar

[17] Zhang B, Zeng H, Prasanna V. Hardware acceleration of large scale GCN inference. In: Proceedings of IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP), 2020. 61--68. Google Scholar

[18] Wang H, Wang K, Yang J, et al. GCN-RL circuit designer: transferable transistor sizing with graph neural networks and reinforcement learning. In: Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC), 2020. 1--6. Google Scholar

[19] Guo X, Xiang S, Zhang Y. Enhanced memory capacity of a neuromorphic reservoir computing system based on a VCSEL with double optical feedbacks. Sci China Inf Sci, 2020, 63: 160407 CrossRef Google Scholar

[20] Cheng W, Cai R, Zeng L. IMCI: an efficient fingerprint retrieval approach based on 3D stacked memory. Sci China Inf Sci, 2020, 63: 179101 CrossRef Google Scholar

[21] Xi K, Bi J, Majumdar S. Total ionizing dose effects on graphene-based charge-trapping memory. Sci China Inf Sci, 2019, 62: 222401 CrossRef Google Scholar

[22] Zha Y, Nowak E, Li J. Liquid Silicon: A Nonvolatile Fully Programmable Processing-in-Memory Processor With Monolithically Integrated ReRAM. IEEE J Solid-State Circuits, 2020, 55: 908-919 CrossRef ADS Google Scholar

[23] Li Z, Yan B, Li H. ReSiPE: ReRAM-based single-spiking processing-in-memory engine. In: Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC), 2020. 1--6. Google Scholar

[24] Zheng Q, Wang Z, Feng Z, et al. Lattice: an ADC/DAC-less ReRAM-based processing-in-memory architecture for accelerating deep convolution neural networks. In: Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC), 2020. 1--6. Google Scholar

[25] Gupta S, Imani M, Sim J, et al. SCRIMP: a general stochastic computing architecture using ReRAM in-memory processing. In: Proceedings of Design, Automation Test in Europe Conference Exhibition (DATE), 2020. 1598--1601. Google Scholar

[26] Yang X, Yan B, Li H, et al. ReTransformer: ReRAM-based processing-in-memory architecture for transformer acceleration. In: Proceedings of IEEE/ACM International Conference On Computer Aided Design (ICCAD), 2020. 1--9. Google Scholar

[27] Wang F, Shen Z, Han L, et al. ReRAM-based processing-in-memory architecture for blockchain platforms. In: Proceedings of the 24th Asia and South Pacific Design Automation Conference (ASPDAC), 2019. 615--620. Google Scholar

[28] Han L, Shen Z, Liu D. A Novel ReRAM-Based Processing-in-Memory Architecture for Graph Traversal. ACM Trans Storage, 2018, 14: 1-26 CrossRef Google Scholar

[29] Chu C, Wang Y, Zhao Y, et al. PIM-Prune: fine-grain DCNN pruning for crossbar-based process-in-memory architecture. In: Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC), 2020. 1--6. Google Scholar

[30] Angizi S, He Z, Rakin A S, et al. CMP-PIM: an energy-efficient comparator-based processing-in-memory neural network accelerator. In: Proceedings of the 55th ACM/ESDA/IEEE Design Automation Conference (DAC), 2018. 1--6. Google Scholar

[31] Yang Y, Chen X, Han Y. Dadu-CD: fast and efficient processing-in-memory accelerator for collision detection. In: Proceedings of the 57th ACM/IEEE Design Automation Conference (DAC), 2020. 1--6. Google Scholar

[32] Liu Z, Ren E, Qiao F. NS-CIM: A Current-Mode Computation-in-Memory Architecture Enabling Near-Sensor Processing for Intelligent IoT Vision Nodes. IEEE Trans Circuits Syst I, 2020, 67: 2909-2922 CrossRef Google Scholar

[33] Imani M, Pampana S, Gupta S, et al. DUAL: acceleration of clustering algorithms using digital-based processing in-memory. In: Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020. 356--371. Google Scholar

[34] Wan Z, Dai G, Soh Y J, et al. An order sampling processing-in-memory architecture for approximate graph pattern mining. In: Proceedings of the 2020 on Great Lakes Symposium on VLSI (GLSVLSI), 2020. 357--362. Google Scholar

[35] Xu S, Chen X, Qian X, et al. TUPIM: a transparent and universal processing-in-memory architecture for unmodified binaries. In: Proceedings of the 2020 on Great Lakes Symposium on VLSI (GLSVLSI), 2020. 199--204. Google Scholar

[36] Kwon Y, Lee Y, Rhu M. TensorDIMM: a practical near-memory processing architecture for embeddings and tensor operations in deep learning. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2019. 740--753. Google Scholar

[37] Gupta S, Imani M, Kaur H. NNPIM: A Processing In-Memory Architecture for Neural Network Acceleration. IEEE Trans Comput, 2019, 68: 1325-1337 CrossRef Google Scholar

[38] Imani M, Gupta S, Sharma S. NVQuery: Efficient Query Processing in Nonvolatile Memory. IEEE Trans Comput-Aided Des Integr Circuits Syst, 2019, 38: 628-639 CrossRef Google Scholar

[39] Chen C H, Hsia T Y, Huang Y. Data Prefetching and Eviction Mechanisms of In-Memory Storage Systems Based on Scheduling for Big Data Processing. IEEE Trans Parallel Distrib Syst, 2019, 30: 1738-1752 CrossRef Google Scholar

[40] Dai G, Huang T, Chi Y. GraphH: A Processing-in-Memory Architecture for Large-Scale Graph Processing. IEEE Trans Comput-Aided Des Integr Circuits Syst, 2019, 38: 640-653 CrossRef Google Scholar

[41] Wang Y, Zhang M, Yang J. Exploiting parallelism for convolutional connections in processing-in-memory architecture. In: Proceedings of the 54th ACM/EDAC/IEEE Design Automation Conference (DAC), 2017. 1--6. Google Scholar

[42] Wang Y, Chen W, Yang J. Towards Memory-Efficient Allocation of CNNs on Processing-in-Memory Architecture. IEEE Trans Parallel Distrib Syst, 2018, 29: 1428-1441 CrossRef Google Scholar

[43] Wang Y, Chen W, Yang J. Exploiting Parallelism for CNN Applications on 3D Stacked Processing-In-Memory Architecture. IEEE Trans Parallel Distrib Syst, 2019, 30: 589-600 CrossRef Google Scholar

[44] Sun H, Zhu Z, Cai Y, et al. An energy-efficient quantized and regularized training framework for processing-in-memory accelerators. In: Proceedings of the 25th Asia and South Pacific Design Automation Conference (ASP-DAC), 2020. 325--330. Google Scholar

[45] Zhang C, Meng T, Sun G. PM3: power modeling and power management for processing-in-memory. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), 2018. 558--570. Google Scholar

[46] Geng T, Li A, Shi R, et al. AWB-GCN: a graph convolutional network accelerator with runtime workload rebalancing. In: Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2020. 922--936. Google Scholar

[47] Yan M, Deng L, Hu X, et al. HyGCN: a GCN accelerator with hybrid architecture. In: Proceedings of IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020. 15--29. Google Scholar

[48] Liang S, Wang Y, Liu C. EnGN: A High-Throughput and Energy-Efficient Accelerator for Large Graph Neural Networks. IEEE Trans Comput, 2021, : 1-1 CrossRef Google Scholar

[49] Hamilton W L, Ying R, Leskovec J. Inductive representation learning on large graphs. In: Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS), 2017. 1025--1035. Google Scholar

  • Figure 1

    (Color online) The typical processing procedure for GCNs.

  • Figure 2

    (Color online) The system architecture of Graph-CIM, which consists of both general-purpose CPU and CIM architecture.

  • Figure 3

    (Color online) A typical graph convolutional network DAGNN is modeled as a DAG $D$.

  • Figure 4

    (Color online) A motivational example for running dataset Citeseer on CPU, 3D-stacked CIM, and the hybrid architecture.

  • Figure 5

    (Color online) Community detection for the input graph to abstract the hierarchical community structures. (a) and (b) The procedure of modularity-based community detection; (c) the hierarchical community structures.

  • Figure 6

    (Color online) Task mapping and task migration of a GCN model. (a) The DAG used to represent a GCN; (b)–(e) the procedures of task mapping and task migration.

  • Figure 7

    (Color online) The processing latency of GCNs on four different types of architectures. (a) GCN; (b) DAGNN; protectłinebreak (c) GWNN; (d) SGC.

  • Figure 8

    (Color online) The processing latency of HEFT [13], PEFT [14], and Graph-CIM on the hybrid architecture. (a) GCN; (b) DAGNN; (c) GWNN; (d) SGC.

  • Figure 9

    (Color online) The utilization ratios of HEFT [13], PEFT [14], and Graph-CIM on the hybrid architecture. (a) GCN; (b) DAGNN; (c) GWNN; (d) SGC.

  • Table 1  

    Table 1The processing latency of the aggregation phase with four GCN networks on both CPU and 3D-stacked CIM architecture

    DatasetCPUCIM Speed increase (CIM vs. CPU)
    DAGNN 1.0 0.05 20.35$\times$
    GCN 1.0 0.11 8.93$\times$
    GWNN 1.0 0.06 17.42$\times$
    SGC 1.0 0.09 11.10$\times$
    Average 14.45$\times$
  •   

    Algorithm 1 Generate aggregation subtasks $(C,~{\rm~nt}_{k},~{\rm~mv}_{k})$

    Require:A list of vertex sets after community detection $C$, the number of subtasks ${\rm~nt}_{k}$, and the maximum number of vertices that the cache can maintain ${\rm~mv}_{k}$.

    Output:A set of aggregation subtasks St.

    Initialize St with ${\rm~nt}_{k}$ empty subtasks;

    $C_{\rm~big}~\leftarrow~\{u|u~\in~C~~{\rm~and}~~u.{\rm~size}~>~{\rm~mv}_{k}\}$;

    $C_{\rm~small}~\leftarrow~\{u|u~\in~C~~{\rm~and}~~u.{\rm~size}~\leq~{\rm~mv}_{k}\}$;

    while $C_{\rm~big}~\neq~\emptyset$ do

    $u~\leftarrow$ DEQUEUE($C_{\rm~big}$);

    $v~\leftarrow$ DEQUEUE($u.{\rm~children}$);

    $u.{\rm~size}~\leftarrow~u.{\rm~size}~-~v.{\rm~size}$;

    $C_{\rm~tmp}~=~\{u,~v\}$;

    for $w~\in~~C_{\rm~tmp}$

    if $w.{\rm~size}~\leq~{\rm~mv}_{k}$ then

    ENQUEUE($C_{\rm~small}$, $w$);

    else

    ENQUEUE($C_{\rm~big}$, $w$);

    end if

    end for

    end while

    Sort $C_{\rm~small}$ by the size of a vertex set;

    for $u~~~\in~~C_{\rm~small}$

    ${\rm~subtask}~\leftarrow~$ the subtask in St with the minimum number of vertices;

    ENQUEUE(${\rm~subtsk}$, $u$);

    end for

    return St.

  •   

    Algorithm 2 Schedule $(T,~{\rm~usage}[])$

    Require:A set of tasks $T=\{T_{1},T_{2},\ldots,T_{m}\}$, the using status of Cores and PEs ${\rm~usage}[~]$.

    Output:A task schedule on the hybrid architecture.

    Get groups GP by performing BFS with task $T$;

    for group ${\rm~gp}~\in~{\rm~GP}$

    Initialize ${\rm~Array}_{\rm~cpu}$, ${\rm~Array}_{\rm~cim}$;

    for task $T_{i}~\in~{\rm~gp}$

    if ${\rm~op}_{i}$ is COMBINATION then

    Add $T_{i}$ to ${\rm~Array}_{\rm~cpu}$;

    else

    Add $T_{i}$ to ${\rm~Array}_{\rm~cim}$;

    end if

    end for

    Obtain ${\rm~EST}_{i}$ from (2) for each $T_{i}~\in~{\rm~gp}$.

    Sort tasks in ${\rm~Array}_{\rm~cpu}$ and ${\rm~Array}_{\rm~cim}$ respectively in ascending order of EST;

    Allocate tasks in ${\rm~Array}_{\rm~cpu}$ successively to the idlest CPU core;

    Allocate tasks in ${\rm~Array}_{\rm~cim}$ successively to the idlest CIM PE;

    ${\rm~end\_move}~\leftarrow~{\rm~False}$;

    Update $A_{\rm~busy}$ and $A_{\rm~idle}$;

    while ${\rm~end\_move}$ is ${\rm~False}$ do

    Obtain the busiest core $P_{\rm~busy}$ in $A_{\rm~busy}$;

    Obtain the idlest core $P_{\rm~idle}$ in $A_{\rm~idle}$;

    Obtain the final task $T_{\rm~tail}$ of core $P_{\rm~busy}$;

    Obtain the execution time $t_{\rm~tail}$ of task $T_{\rm~tail}$ in $A_{\rm~idle}$;

    if ${\rm~usage}[P_{\rm~idle}]~+~t_{\rm~tail}~<~{\rm~usage}[P_{\rm~busy}]$ then

    Migrate $T_{\rm~tail}$ from core $P_{\rm~busy}$ to $P_{\rm~idle}$;

    else

    ${\rm~end\_move}~\leftarrow~{\rm~True}$;

    end if

    end while

    end for

  • Table 2  

    Table 2The processing latency of the combination phase with four GCN networks on both CPU and 3D-stacked CIM architecture

    DatasetCPU CIM Speed increase (CPU vs. CIM)
    DAGNN 1.0 2.55 2.55$\times$
    GCN 1.0 2.45 2.45$\times$
    GWNN 1.0 2.52 2.52$\times$
    SGC 1.0 2.60 2.60$\times$
    Average 2.53$\times$
  • Table 3  

    Table 3GCN models and standard graph datasets

    DatasetNumber of verticesNumber of edgesNumber of featuresNumber of classes
    Blogcatalog 519617174381897
    Citeseer 331247153703 6
    Coauthor 18333818946805 15
    DBLP 177165286716394
    Pubmed 19717443245003
qqqq

Contact and support