logo

SCIENCE CHINA Information Sciences, Volume 63 , Issue 12 : 222103(2020) https://doi.org/10.1007/s11432-020-3097-4

Jittor: a novel deep learning framework with meta-operators and unified graph execution

More info
  • ReceivedAug 25, 2020
  • AcceptedSep 23, 2020
  • PublishedNov 13, 2020

Abstract


References

[1] Collobert R, Bengio S, Mariethoz J. Torch: a modular machine learning software library. In: Proceedings of IDIAP Research Report, 2002. Google Scholar

[2] Al-Rfou R, Alain G, Almahairi A, et al. Theano: a python framework for fast computation of mathematical expressions. 2016,. arXiv Google Scholar

[3] Jia Y, Shelhamer E, Donahue J, et al. Caffe: convolutional architecture for fast feature embedding. In: Proceedings of ACM International Conference on Multimedia, 2014. 675--678. Google Scholar

[4] Abadi M, Barham P, Chen J, et al. Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th Symposium on Operating Systems Design and Implementation, 2016. 265--283. Google Scholar

[5] Paszke A, Gross S, Massa F, et al. Pytorch: an imperative style, high-performance deep learning library. In: Proceedings of Advances in Neural Information Processing Systems, 2019. 8024--8035. Google Scholar

[6] Cyphers D S, Bansal A K, Bhiwandiwalla A, et al. Intel nGraph: an intermediate representation, compiler, and executor for deep learning. 2018,. arXiv Google Scholar

[7] Schoenholz S S, Cubuk E D. JAX, M.D.: End-to-end differentiable, hardware accelerated, molecular dynamics in pure Python. 2019,. arXiv Google Scholar

[8] Chen T, Moreau T, Jiang Z, et al. TVM: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 13th Symposium on Operating Systems Design and Implementation, 2018. 578--594. Google Scholar

[9] Nickolls J, Buck I, Garland M, et al. Scalable parallel programming with CUDA. In: Proceedings of IEEE Hot Chips 20 Symposium (HCS), 2008. Google Scholar

[10] Thompson J A, Schlachter K. An introduction to the OpenCL programming model. 2012. https://cims.nyu.edu/~schlacht/OpenCLModel.pdf. Google Scholar

[11] Oliphant T E. Guide to NumPy. North Charleston: CreateSpace Publishing, 2015. Google Scholar

[12] Chen T Q, Li M, Li Y T, et al. Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. 2015,. arXiv Google Scholar

[13] Chetlur S, Woolley C, Vandermersch P, et al. cuDNN: efficient primitives for deep learning. 2014,. arXiv Google Scholar

[14] Lattner C, Adve V S. LLVM: a compilation framework for lifelong program analysis & transformation. In: Proceedings of International Symposium on Code Generation and Optimization, 2004. 97--104. Google Scholar

[15] Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors. Nature, 1986, 323: 533-536 CrossRef ADS Google Scholar

[16] Gabriel E, Fagg G E, Bosilca G, et al. Open MPI: goals, concept, and design of a next generation MPI implementation. In: Proceedings of European Parallel Virtual Machine/Message Passing Interface Users' Group Meeting, 2004. 97--104. Google Scholar

[17] Tokui S, Oono K. Chainer: a next-generation open source framework for deep learning. In: Proceedings of Workshop on Machine Learning Systems (LearningSys), 2015. Google Scholar

[18] Goodfellow I J, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets. In: Proceedings of Advances in Neural Information Processing Systems, 2014. Google Scholar

[19] Sutton R S, Barto A G. Reinforcement Learning: An Introduction. Cambridge: MIT Press, 2018. Google Scholar

[20] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770--778. Google Scholar

[21] Krizhevsky A, Sutskever I, Hinton G E. Imagenet classification with deep convolutional neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2012. 1106--1114. Google Scholar

[22] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2015,. arXiv Google Scholar

[23] Zagoruyko S, Komodakis N. Wide residual networks. 2016,. arXiv Google Scholar

[24] Iandola F N, Han S, Moskewicz M W, et al. Squeezenet: alexnet-level accuracy with 50x fewer parameters and <0.5 MB model size. 2017,. arXiv Google Scholar

[25] Xie S N, Girshick R, Dollár P, et al. Aggregated residual transformations for deep neural networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 5987--5995. Google Scholar

[26] Gao S H, Cheng M M, Zhao K, et al. Res2Net: a new multi-scale backbone architecture. IEEE Trans Pattern Anal Mach Intell, 2019. Google Scholar

[27] Gulrajani I, Ahmed F, Arjovsky M, et al. Improved training of wasserstein GANs. In: Proceedings of Advances in Neural Information Processing Systems, 2017. 5767--5777. Google Scholar

[28] Radford A, Metz L, Chintala S. Unsupervised representation learning with deep convolutional generative adversarial networks. In: Proceedings of the 4th International Conference on Learning Representations, 2016. Google Scholar

[29] Mao X D, Li Q, Xie H R, et al. Least squares generative adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 2813--2821. Google Scholar

[30] Zhu J Y, Park T, Isola P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of IEEE International Conference on Computer Vision, 2017. 2223--2232. Google Scholar

[31] LeCun Y, Cortes C. The MNIST database of handwritten digits. 2005. Google Scholar

[32] Cordts M, Omran M, Ramos S, et al. The cityscapes dataset for semantic urban scene understanding. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. Google Scholar

[33] Li T M. Differentiable visual computing. 2019,. arXiv Google Scholar

[34] Kato H, Ushiku Y, Harada T. Neural 3D mesh renderer. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 3907--3916. Google Scholar

[35] Hu Y M, Anderson L, Li T M, et al. Difftaichi: differentiable programming for physical simulation. In: Proceedings of the 8th International Conference on Learning Representations, 2020. Google Scholar

  • Figure 1

    Jittor in use.

  • Figure 2

    Building models from meta-operators. Operators from the three meta-operator classes, reindex, reindex-reduce, and element-wise, are fused to provide other common deep learning operators, which in turn are used to build the model.

  • Table 1  

    Table 1Inferencing speed comparison (FPS) between Jittor and PyTorch$^{\rm~a)}$

    ModelBatch size
    1248163264128
    PT JT PT JT PT JT PT JT PT JT PT JT PT JT PT JT
    ResNet50 [20] 185 220 353 357 492 548 575 667 643 773 668 810 680 829 692 836
    ResNet152 [20] 64 86 126 132 209 231 251 287 273 321 283 335 288 346 296 354
    Wide ResNet50_2 [23] 134 131 204 202 275 288 310 335 353 397 379 426 391 441 397 437
    Wide ResNet101_2 [23] 73 72 111 112 162 169 180 194 203 225 220 243 228 254 230 253
    ResNEXT50_32$\times$4d [25] 118 132 236 216 310 341 393 445 458 536 484 572 495 579 503 586
    ResNEXT101_32$\times$8d [25] 50 53 78 81 123 136 149 162 166 185 178 198 183 204 185 205
    Res2Net50 [26] 79 149 152 240 299 355 399 451 441 507 460 547 468 553 468 559
    AlexNet [21] 865 818 1562 1500 2626 2622 3553 3745 5070 5546 6736 7431 6836 7531 6856 7551
    VGG11 [22] 303 315 201 208 322 337 593 665 741 855 808 873 818 883 820 885
    SqueezeNet1_1 [24] 404 842 769 1461 1619 2102 2656 2700 3035 3382 3258 3628 3406 3658 3407 3631

    a) The boldface represents the faster framework for the same model and batch size.

  • Table 2  

    Table 2Training speed comparison (number of iterations per second) of 4 GAN models

    WGAN-GP DCGAN LSGAN CycleGAN
    Dataset MNIST MNIST MNIST cityscapes
    PyTorch 52.35 37.73 38.61 10.08
    Jittor 149.25 78.74 78.74 13.96
    Jittor relative speed 2.9 2.1 2.0 1.4
  • Table 3  

    Table 3Ablation studies about asynchronous interface, cross iteration fusion, unified memory and lazy execution

    FPS Speedup ratio
    PyTorch 253.1 0.965
    Without asynchronous interface 254.9 0.972
    Without cross iteration fusion 260.4 0.993
    Without unified memory 264.7 1.009
    Without lazy execution 73.2 0.279
    All-features-on 262.2 1.000