SCIENCE CHINA Information Sciences, Volume 63 , Issue 8 : 182102(2020) https://doi.org/10.1007/s11432-019-2705-2

Collaborative deep learning across multiple data centers

More info
  • ReceivedMar 21, 2019
  • AcceptedOct 28, 2019
  • PublishedJul 14, 2020



[1] Tian L, Jayaraman B, Gu Q, et al. Aggregating private sparse learning models using multi-party computation. In: NIPS Workshop on Private Multi-Party Machine Learning, Barcelona, Spain, 2016. Google Scholar

[2] Amir-Khalili A, Kianzad S, Abugharbieh R, et al. Scalable and fault tolerant platform for distributed learning on private medical data. In: Proceedings of International Workshop on Machine Learning in Medical Imaging. Springer, 2017. 176--184. Google Scholar

[3] Yang Q, Liu Y, Chen T, et al. Federated machine learning: Concept and applications. ACM Trans Intell Syst Tech (TIST), 2019, 10: 12. Google Scholar

[4] Cano I, Weimer M, Mahajan D, et al. 2016. Towards geo-distributed machine learning,. arXiv Google Scholar

[5] Hsieh K, Harlap A, Vijaykumar N, et al. Gaia: geo-distributed machine learning approaching lan speeds. In: Proceedings of NSDI, 2017. 629--647. Google Scholar

[6] McMahan H B, Moore E, Ramage D, et al. Communication-efficient learning of deep networks from decentralized data. 2016,. arXiv Google Scholar

[7] Izmailov P, Podoprikhin D, Garipov T, et al. Averaging weights leads to wider optima and better generalization. 2018,. arXiv Google Scholar

[8] Povey D, Zhang X, Khudanpur S. Parallel training of dnns with natural gradient and parameter averaging. 2014,. arXiv Google Scholar

[9] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014,. arXiv Google Scholar

[10] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. 770--778. Google Scholar

[11] Huang G, Liu Z, Van Der Maaten L, et al. Densely connected convolutional networks. In: Proceedings of CVPR, 2017. Google Scholar

[12] Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules. In: Advances in Neural Information Processing Systems, 2017. 3856--3866. Google Scholar

[13] Tong Y, Chen Y, Zhou Z, et al. The simpler the better: a unified approach to predicting original taxi demands based on large-scale online platforms. In: Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, 2017. 1653--1662. Google Scholar

[14] Guo L, Guo C, Li L. Two-stage local constrained sparse coding for fine-grained visual categorization. Sci China Inf Sci, 2018, 61: 018104 CrossRef Google Scholar

[15] Chen C, Peng X, Sun J. Generative API usage code recommendation with parameter concretization. Sci China Inf Sci, 2019, 62: 192103 CrossRef Google Scholar

[16] Zhang H, Zheng Z, Xu S, et al. Poseidon: an efficient communication architecture for distributed deep learning on gpu clusters. 2017,. arXiv Google Scholar

[17] Shokri R, Shmatikov V. Privacy-preserving deep learning. In: Proceedings of the 22nd ACM SIGSAC Conference on Computer and Communications Security, ACM, 2015. 1310--1321. Google Scholar

[18] Zinkevich M, Weimer M, Li L, et al. Parallelized stochastic gradient descent. In: Advances in Neural Information Processing Systems. 2010. 2595--2603. Google Scholar

[19] Alistarh D, Grubic D, Li J, et al. Qsgd: Communication-efficient sgd via gradient quantization and encoding. In: Advances in Neural Information Processing Systems. 2017. 1709--1720. Google Scholar

[20] Lin Y, Han S, Mao H, et al. Deep gradient compression: Reducing the communication bandwidth for distributed training. 2017,. arXiv Google Scholar

[21] Anil R, Pereyra G, Passos A, et al. Large scale distributed neural network training through online distillation. 2018,. arXiv Google Scholar

[22] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. 2015,. arXiv Google Scholar

[23] Su H, Chen H. Experiments on parallel training of deep neural network using model averaging. 2015,. arXiv Google Scholar

[24] Sun S, Chen W, Bian J, et al. Ensemble-compression: a new method for parallel training of deep neural networks. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, 2017. 187--202. Google Scholar

[25] Goodfellow I J, Vinyals O, Saxe A M. Qualitatively characterizing neural network optimization problems. 2014,. arXiv Google Scholar

[26] Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: a simple way to prevent neural networks from overfitting. J Machine Learning Res, 2014, 15: 1929--1958. Google Scholar

[27] Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by reducing internal covariate shift. 2015,. arXiv Google Scholar

[28] Russakovsky O, Deng J, Su H. ImageNet Large Scale Visual Recognition Challenge. Int J Comput Vis, 2015, 115: 211-252 CrossRef Google Scholar

[29] Sainath T N, Parada C. Convolutional neural networks for small-footprint keyword spotting. In: Prceedings of the Sixteenth Annual Conference of the International Speech Communication Association, 2015. Google Scholar

[30] Gemmeke J F, Ellis D P, Freedman D, et al. Audio set: an ontology and human-labeled dataset for audio events. In: Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2017. 776--780. Google Scholar

[31] Greff K, Srivastava R K, Koutnik J. LSTM: A Search Space Odyssey. IEEE Trans Neural Netw Learning Syst, 2017, 28: 2222-2232 CrossRef Google Scholar

[32] Hinton G, Frosst N, Sabour S. Matrix capsules with em routing, 2018. Google Scholar

[33] Bojanowski P, Grave E, Joulin A. Enriching Word Vectors with Subword Information. Trans Association Comput Linguistics, 2017, 5: 135-146 CrossRef Google Scholar

[34] Yu C, Barsim K S, Kong Q, et al. Multi-level attention model for weakly supervised audio classification. 2018,. arXiv Google Scholar

  • Figure 1

    (Color online) Workflow of co-learning. Assume that the participants are different data centers. Each participant holds an amount of private data and uses the disjoint data to train a local classifier. The local model parameters will be averaged by the global server to formulate the new shared model, which in turn are used for as the starting point for the next round of local training. Besides the new shared model, the global server also updates the number of local training epochs and the learning rate.

  • Table 1  

    Table 1Stats for using CLR+ILE on different models in a communication round

    Models Communication interval (min/$T_0$) Communication volume (MB)
    DenseNet-40 4.5 / 5 13
    ResNet-152 30 / 5 223
    Inception-V4 60 / 20 168
    Inception-ResNet-V2 27.5 / 5 218
  • Table 2  

    Table 2CIFAR-10 accuracy comparison between ensemble-learning, vanilla-learning and co-learning

    Model Accuracy (%)
    Vanilla Ensemble Co-learning
    VGG-19 89.44 80.39 89.64
    ResNet-152 92.64 85.4 93.51
    Inception-V4 91.34 83.83 92.07
    Inception-ResNet-V2 92.86 84.7 92.83
    DenseNet-40 91.35 81.24 91.43
  • Table 3  

    Table 3Test accuracy of ImageNet-2014 using different models

    Model Accuracy(%)
    Top-1 Top-5
    VGG-19 Vanilla 70.41 88.12
    Co-learning 70.62 88.7
    Inception-V4 Vanilla 79.16 93.82
    Co-learning 79.35 94.28
    ResNet-V2-101 Vanilla 75.66 92.28
    Co-learning 75.85 92.39
  • Table 4  

    Table 4Multi-class AUC on toxic comment classification challenge dataset

    Model Multi-class AUC (%)
    Vanilla Co-learning
    LSTM 98.52 98.79
    Capsule 98.32 98.75
  • Table 5  

    Table 5TensorFlow speech commands recognition

    Method Validation accuracy (%) Test accuracy (%)
    Vanilla 93.1 93.3
    Co-learning 93.3 93.2
  • Table 6  

    Table 6Audio set classification task using a single/multi data center(s)$^{\rm~a)}$

    Vanilla / co-learning
    Model MAP$^{\rm~b)}$ AUC d-prime
    AP $\bf{0.300}$ / 0.299 $\bf{0.964}$ / 0.962 $\bf{2.536}$ / 2.506
    MP 0.292 / 0.292 $\bf{0.960}$ / 0.959 $\bf{2.471}$ / 2.456
    SA 0.337 / 0.337 $\bf{0.968}$ / 0.966 $\bf{2.612}$ / 2.574
    MA $\bf{0.357}$ / 0.352 0.968 / 0.968 $\bf{2.621}$ / 2.618

    a) AP represents result of CRNN with average pooling, MP for CRNN with max pooling, SA for CRNN with single attention and MA for CRNN with multi-attention. b) MAP: mean average precision.