SCIENCE CHINA Information Sciences, Volume 63 , Issue 1 : 112102(2020) https://doi.org/10.1007/S11432-018-9944-X

Snapshot boosting: a fast ensemble framework for deep neural networks

More info
  • ReceivedDec 18, 2018
  • AcceptedApr 15, 2019
  • PublishedDec 24, 2019



This work was supported by National Natural Science Foundation of China (Grant Nos. 61832001, 61702015, 61702016, 61572039), National Key Research and Development Program of China (Grant No. 2018YFB1004403), and PKU-Tencent Joint Research Lab.


[1] Liu L, Du X, Zhu L. Learning Discrete Hashing Towards Efficient Fashion Recommendation. Data Sci Eng, 2018, 3: 307-322 CrossRef Google Scholar

[2] Abdelatti M, Yuan C Z, Zeng W. Cooperative deterministic learning control for a group of homogeneous nonlinear uncertain robot manipulators. Sci China Inf Sci, 2018, 61: 112201 CrossRef Google Scholar

[3] Arun K S, Govindan V K. A Hybrid Deep Learning Architecture for Latent Topic-based Image Retrieval. Data Sci Eng, 2018, 3: 166-195 CrossRef Google Scholar

[4] Zhang C, Bengio S, Hardt M, et al. Understanding deep learning requires rethinking generalization. 2016,. arXiv Google Scholar

[5] Opitz D, Maclin R. Popular Ensemble Methods: An Empirical Study. jair, 1999, 11: 169-198 CrossRef Google Scholar

[6] Melville P, Mooney R J. Creating diversity in ensembles using artificial data. Inf Fusion, 2005, 6: 99-111 CrossRef Google Scholar

[7] Jiang J, Cui B, Zhang C, et al. DimBoost: Boosting Gradient Boosting Decision Tree to Higher Dimensions. In: Proceedings of the 2018 International Conference on Management of Data. New York: ACM, 2018. 1363--1376. Google Scholar

[8] Gao W, Zhou Z H. On the doubt about margin explanation of boosting. Artificial Intelligence, 2013, 203: 1-18 CrossRef Google Scholar

[9] Mosca A, Magoulas G D. Deep incremental boosting. arXiv preprint. 2017,. arXiv Google Scholar

[10] Quinlan J R. Bagging, boosting, and C4. 5. In: Proceedings of the 13th National Conference on Artificial Intelligence and 8th Innovative Applications of Artificial Intelligence Conference, Portland, 1996. 725--730. Google Scholar

[11] Huang G, Li Y, Pleiss G, et al. Snapshot ensembles: Train 1, get M for free. 2017,. arXiv Google Scholar

[12] Loshchilov I, Hutter F. Sgdr: stochastic gradient descent with warm restarts. 2016,. arXiv Google Scholar

[13] Zhou Z H. Ensemble methods: foundations and algorithms. Chapman and Hall/CRC, 2012. Google Scholar

[14] LeCun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 521: 436-444 CrossRef PubMed ADS Google Scholar

[15] Dietterich T G. Ensemble methods in machine learning. In: Proceedings of the International Workshop on Multiple Classifier Systems. Berlin: Springer, 2000. 1--15. Google Scholar

[16] Naftaly U, Intrator N, Horn D. Optimal ensemble averaging of neural networks. Network-Computation Neural Syst, 1997, 8: 283-296 CrossRef Google Scholar

[17] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. 770--778. Google Scholar

[18] Friedman J, Hastie T, Tibshirani R. The elements of statistical learning. New York: Springer, 2001. Google Scholar

[19] Schwenk H, Bengio Y. Training methods for adaptive boosting of neural networks. In: Proceedings of the Advances in Neural Information Processing Systems, 1998. 647--653. Google Scholar

[20] Bucilu C, Caruana R, Niculescu-Mizil A. Model compression. In: Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2006. 535--541. Google Scholar

[21] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network. arXiv preprint,. arXiv Google Scholar

[22] Breiman L. Stacked regressions. Mach Learn, 1996, 24: 49-64 CrossRef Google Scholar

[23] van der Laan M J, Polley E C, Hubbard A E. Super learner.. Statistical Appl Genets Mol Biol, 2007, 6 CrossRef PubMed Google Scholar

[24] Young S, Abdou T, Bener A. Deep super learner: a deep ensemble for classification problems. In: Proceedings of the 31st Canadian Conference on Artificial Intelligence, Toronto, 2018. 84--95. Google Scholar

[25] Ju C, Bibaut A, van der Laan M. The relative performance of ensemble methods with deep convolutional neural networks for image classification. J Appl Stat, 2018, 45: 2800-2818 CrossRef Google Scholar

[26] Seyyedsalehi S Z, Seyyedsalehi S A. A fast and efficient pre-training method based on layer-by-layer maximum discrimination for deep neural networks. Neurocomputing, 2015, 168: 669-680 CrossRef Google Scholar

[27] Zhou Z H, Wu J, Tang W. Ensembling neural networks: Many could be better than all. Artificial Intelligence, 2002, 137: 239-263 CrossRef Google Scholar

[28] Aho K, Derryberry D W, Peterson T. Model selection for ecologists: the worldviews of AIC and BIC. Ecology, 2014, 95: 631-636 CrossRef Google Scholar

[29] Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence, Montreal, 1995. 1137--1145. Google Scholar

[30] Brownlee J. Discover feature engineering, how to engineer features and how to get good at it. Machine Learning Process, 2014. Google Scholar

[31] Guyon I, Elisseeff A. An introduction to variable and feature selection. J Mach Learn Res, 2003, 3: 1157?C1182. Google Scholar

[32] Huang G, Liu Z, Van Der Maaten L, et al. In: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 4700--4708. Google Scholar

[33] Sutskever I, Vinyals O, Le Q V. Sequence to sequence learning with neural networks. In: Proceedings of Advances in Neural Information Processing Systems, 2014. 3104--3112. Google Scholar

[34] Krizhevsky A, Hinton G. Learning Multiple Layers of Features From Tiny Images. Technical Report, University of Toronto, 2009. Google Scholar

[35] Lin M, Chen Q, Yan S. Network in network. 2013,. arXiv Google Scholar

[36] Maas A L, Daly R E, Pham P T, et al. Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011. 142--150. Google Scholar

[37] Freund Y, Schapire R E. Experiments with a new boosting algorithm. Icml, 1996, 96: 148-156. Google Scholar

  • Figure 1

    (Color online) The learning rate schedule of a 40-layer DenseNet on CIFAR100 using snapshot ensemble and snapshot boosting. Note that SGDR is used in snapshot ensemble and its learning rate restarts every 50 epochs.

  • Figure 2

    (Color online) The procedure of snapshot boosting.

  • Figure 3

    (Color online) The test accuracy of a 40-layer DenseNet on CIFAR100 using snapshot ensemble and snapshot boosting. Note that SGDR is used in snapshot ensemble and its learning rate restarts every 50 epochs.

  • Figure 4

    (Color online) Test accuracy of ensembles on CIFAR-100 using ResNet-32 (a) and DenseNet-40 (b). For the single model, the test accuracy is directly calculated on the test set in the last epoch. For the ensemble method, the test accuracy is the ensemble accuracy which is calculated with the base models already trained.

  • Figure 5

    (Color online) Pairwise correlation of softmax outputs between base models for DenseNet-40 on CIFAR-100. (a) Snapshot ensemble; (b) snapshot boosting.

  • Figure 6

    (Color online) The test accuracy of snapshot boosting on CIFAR-100 using ResNet-32 with different resetting learning rate $r_2$ and split ratio $\alpha$.

  • Table 1   Comparisons on CV tasks$^{\rm~a)}$
    Model MethodNumber of Best base model Ensemble Increased
    base modelaccuracy (%)accuracy (%) accuracy (%)
    6*ResNet-32Single model169.66
    Deep incremental boosting767.7871.343.56
    Bagging867.8671.88 4.02
    Snapshot ensemble1069.8372.983.15
    Snapshot boosting1668.7574.165.41
    6*DenseNet-40Single model172.07
    Deep incremental boosting771.2273.452.23
    Snapshot ensemble871.7173.311.60
    Snapshot boosting1471.5574.973.42



    Algorithm 1 Snapshot boosting

    Require:$D_0$: the original training set; $D_1$: the test set; $T$: the number of base models; $\alpha$: the split ratio; $r_1$: the initial learning rate for training the first base model; $r_2$: the initial learning rate for training other base models; $k$: the number of classes; $p_1$; $p_2$; $\delta$; $f$;

    Output: The prediction on the test set, $R$;

    $|D_2|~=|D_0|(1-\alpha)$; $|D_3|~=|D_0|\alpha$;

    $m~=~|D_2|$; $n~=~|D_3|$; $t~=0$;



    for $t=1$ to $T$

    Set the learning rate to $r_2$ and decay it according to the validation accuracy on $D_3$;


    Save the best mode $h_{t}$ before the validation accuracy on $D_3$ is no significant increased;

    Get the error $\epsilon_t$ on $D_3$, $\epsilon_t=\frac{1}{n}\sum_{i=1}^{n}~I(h_t(x_i)\ne~y_i)$;



    Where $Z_t$ is a normalization factor;

    $V_t~\gets$ use $h_t$ to predict the softmax outputs of $D_3$;

    $S_t~\gets$ use $h_t$ to predict the softmax outputs of $D_1$;


    end for

    $V~\gets$ merge each base model's outputs $V_t$, $t=1,2,\ldots,T$;

    $S~\gets$ merge each base model's outputs $S_t$, $t=1,2,\ldots,T$;

    Use $V$ to do the model selection and feature engineering;

    $V^{*}~\gets$ $V$'s remaining features;

    $S^{*}~\gets$ $S$'s remaining features;

    $M~\gets$ fit a meta-learner on $V^{*}$;


  • Table 2   Ensemble accuracy on IMDB dataset$^{\rm~a)}$
    Model MethodAccuracy (%)
    5*LSTMSingle model89.02
    Snapshot ensemble90.71
    Snapshot boosting (average)91.17
    Snapshot boosting91.52


  • Table 3   Ensemble accuracy on CIFAR-10 dataset$^{\rm~a)}$
    Model MethodAccuracy (%)
    6*ResNet-32Single model93.50
    Snapshot ensemble94.26
    Snapshot boosting (average)94.87
    Snapshot boosting95.12
    6*DenseNet-40Single model93.43
    Snapshot ensemble93.57
    Snapshot boosting (average)93.96
    Snapshot boosting94.31