logo

SCIENCE CHINA Information Sciences, Volume 64 , Issue 9 : 192106(2021) https://doi.org/10.1007/s11432-020-3112-8

EAT-NAS: elastic architecture transfer for accelerating large-scale neural architecture search

More info
  • ReceivedFeb 28, 2020
  • AcceptedAug 4, 2020
  • PublishedAug 6, 2021

Abstract


Acknowledgment

This work was in part supported by National Natural Science Foundation of China (NSFC) (Grant Nos. 61876212, 61976208, 61733007), Zhejiang Lab (Grant No. 2019NB0AB02), and HUST-Horizon Computer Vision Research Center. We thank Liangchen SONG and Guoli WANG for the discussion and assistance.


References

[1] Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. Google Scholar

[2] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. Google Scholar

[3] Sandler M, Howard A, Zhu M, et al. Mobilenetv2: inverted residuals and linear bottlenecks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. Google Scholar

[4] Chen L C, Papandreou G, Kokkinos I. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. IEEE Trans Pattern Anal Mach Intell, 2018, 40: 834-848 CrossRef Google Scholar

[5] Chen L C, Papandreou G, Schroff F, et al. Rethinking atrous convolution for semantic image segmentation. 2017,. arXiv Google Scholar

[6] Huang Z L, Wang X G, Huang L C, et al. CCNet: criss-cross attention for semantic segmentation. In: Proceedings of International Conference on Computer Vision, 2019. Google Scholar

[7] Huang Z L, Wang X G, Wei Y C, et al. CCNet: criss-cross attention for semantic segmentation. IEEE Trans Pattern Anal Mach Intell, 2020. doi: 10.1109/TPAMI.2020.3007032. Google Scholar

[8] Ren S, He K, Girshick R. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans Pattern Anal Mach Intell, 2017, 39: 1137-1149 CrossRef Google Scholar

[9] Liu W, Anguelov D, Erhan D, et al. SSD: single shot multibox detector. In: Proceedings of European Conference on Computer Vision, 2016. Google Scholar

[10] Lin T Y, Goyal P, Girshick R, et al. Focal loss for dense object detection. In: Proceedings of International Conference on Computer Vision, 2017. Google Scholar

[11] Yi P, Wang Z, Jiang K. Multi-Temporal Ultra Dense Memory Network for Video Super-Resolution. IEEE Trans Circuits Syst Video Technol, 2020, 30: 2503-2516 CrossRef Google Scholar

[12] Zoph B, Vasudevan V, Shlens J, et al. Learning transferable architectures for scalable image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. Google Scholar

[13] Real E, Aggarwal A, Huang Y, et al. Regularized evolution for image classifier architecture search. In: Proceedings of AAAI Conference on Artificial Intelligence, 2019. Google Scholar

[14] Pham H, Guan M Y, Zoph B, et al. Efficient neural architecture search via parameter sharing. In: Proceedings of International Conference on Machine Learning, 2018. Google Scholar

[15] Zoph B, Le Q V. Neural architecture search with reinforcement learning. In: Proceedings of International Conference on Learning Representations, 2017. Google Scholar

[16] Krizhevsky A, Hinton G. Learning Multiple Layers of Features From Tiny Images. Technical Report, 2009. Google Scholar

[17] Deng J, Dong W, Socher R, et al. Imagenet: a large-scale hierarchical image database. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2009. Google Scholar

[18] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition. 2014,. arXiv Google Scholar

[19] Szegedy C, Liu W, Jia Y Q, et al. Going deeper with convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2015. Google Scholar

[20] Liu C X, Zoph B, Neumann M, et al. Progressive neural architecture search. In: Proceedings of European Conference on Computer Vision, 2018. Google Scholar

[21] Tommasi T, Patricia N, Caputo B, et al. A deeper look at dataset bias. In: Proceedings of Domain Adaptation in Computer Vision Applications, 2017. Google Scholar

[22] Tan M X, Chen B, Pang R M, et al. Mnasnet: platform-aware neural architecture search for mobile. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. Google Scholar

[23] Zhong Z, Yan J J, Wu W, et al. Practical block-wise neural network architecture generation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. Google Scholar

[24] Miikkulainen R, Liang J, Meyerson E, et al. Evolving deep neural networks. In: Proceedings of Artificial Intelligence in the Age of Neural Networks and Brain Computing. 2019. Google Scholar

[25] Lu Z C, Whalen I, Boddeti V, et al. NSGA-Net: a multi-objective genetic algorithm for neural architecture search. 2018,. arXiv Google Scholar

[26] Liu H, Simonyan K, Yang Y. DARTS: differentiable architecture search. In: Proceedings of International Conference on Learning Representations, 2019. Google Scholar

[27] Zhang X B, Huang Z H, Wang N Y. You only search once: single shot neural architecture search via direct sparse optimization. 2018,. arXiv Google Scholar

[28] Cai H, Zhu L G, Han S. ProxylessNAS: direct neural architecture search on target task and hardware. In: Proceedings of International Conference on Learning Representations, 2019. Google Scholar

[29] Fang J M, Sun Y Z, Zhang Q, et al. Densely connected search space for more flexible neural architecture search. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2020. Google Scholar

[30] Fang J M, Sun Y Z, Peng K, et al. Fast neural network adaptation via parameter remapping and architecture search. In: Proceedings of International Conference on Learning Representations, 2020. Google Scholar

[31] Dong X Y, Yang Y. Searching for a robust neural architecture in four GPU hours. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. Google Scholar

[32] Mei J R, Li Y W, Lian X C, et al. Atomnas: fine-grained end-to-end neural architecture search. In: Proceedings of International Conference on Learning Representations, 2020. Google Scholar

[33] Wu B C, Dai X L, Zhang P Z, et al. FBNet: hardware-aware efficient convnet design via differentiable neural architecture search. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2019. Google Scholar

[34] Chang J L, Zhang X B, Guo Y W, et al. Data: differentiable architecture approximation. In: Proceedings of Conference on Neural Information Processing Systems, 2019. Google Scholar

[35] Wong C, Houlsby N, Lu Y F, et al. Transfer learning with neural automl. In: Proceedings of Conference on Neural Information Processing Systems, 2018. Google Scholar

[36] Deb K. Multi-objective optimization. In: Proceedings of Search Methodologies, 2014. Google Scholar

[37] Goldberg D E, Deb K. A comparative analysis of selection schemes used in genetic algorithms. Foundations of Genetic Algorithms, 1991, 1: 69-93, doi: 10.1016/B978-0-08-050684-5.50008-2. Google Scholar

[38] Liu H X, Simonyan K, Vinyals O, et al. Hierarchical representations for efficient architecture search. In: Proceedings of International Conference on Learning Representations, 2018. Google Scholar

[39] Chollet F. Xception: deep learning with depthwise separable convolutions. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2017. Google Scholar

[40] Chen T, Goodfellow I J, Shlens J. Net2net: accelerating learning via knowledge transfer. In: Proceedings of International Conference on Learning Representations, 2016. Google Scholar

[41] Loshchilov I, Hutter F. SGDR: stochastic gradient descent with warm restarts. 2016,. arXiv Google Scholar

[42] DeVries T, Taylor G W. Improved regularization of convolutional neural networks with cutout. 2017,. arXiv Google Scholar

[43] Howard A G, Zhu M L, Chen B, et al. Mobilenets: efficient convolutional neural networks for mobile vision applications. 2017,. arXiv Google Scholar

[44] Zhang X Y, Zhou X Y, Lin M X, et al. Shufflenet: an extremely efficient convolutional neural network for mobile devices. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. Google Scholar

[45] Chen X, Xie L X, Wu J, et al. Progressive differentiable architecture search: bridging the depth GAP between search and evaluation. In: Proceedings of International Conference on Computer Vision, 2019. Google Scholar

[46] Xu Y H, Xie L X, Zhang X P, et al. PC-DARTS: partial channel connections for memory-efficient architecture search. In: Proceedings of International Conference on Learning Representations, 2020. Google Scholar

[47] Xie S R, Zheng H H, Liu C X, et al. SNAS: stochastic neural architecture search. In: Proceedings of International Conference on Learning Representations, 2019. Google Scholar

[48] Real E, Moore S, Selle A, et al. Large-scale evolution of image classifiers. In: Proceedings of International Conference on Machine Learning, 2017. Google Scholar

  • Figure 1

    (Color online) Framework of the elastic architecture transfer for NAS (EAT-NAS). We first search for the basic architecture on a small-scale task and then search on a large-scale task with the basic architecture as the seed of the new population initialization.

  • Figure 2

    (Color online) Search space. During the search, all the blocks are concatenated to constitute the whole network architecture. Each block comprises several layers and is represented by the following five primitives: convolutional operation type, kernel size, skip connection, width and depth.

  • Figure 3

    (Color online) The architectures searched by EAT-NAS. The upper one is the basic architecture searched on CIFAR-10. And the lower one is the architecture, namely EATNet-A, searched on ImageNet which is transferred from the basic architecture.

  • Figure 4

    (Color online) Comparing the evolution processes between EAT-NAS and search from scratch on ImageNet. (a) Mean accuracy of models in the population; (b) population quality.

  • Figure 5

    (Color online) Mean accuracy and the mean multiply-Adds of the models during the search on ImageNet whose basic architecture has worse performance on CIFAR-10.

  •   

    Algorithm 1 Evolutionary algorithm

    Require:Population size $P$, sample size $S$, dataset $\mathbb{D}$.

    Output:The best model $M_{\rm~best}$.

    $\mathbb{P}^{(0)}$ $\gets$ initialize$P$);

    for $1~\le~j~\le~P$

    $M_j.{\rm~acc}$ $\gets$ train-eval$M_j$, $\mathbb{D}$);

    $M_j.{\rm~score}$ $\gets$ comp-score$M_j,~M_j.{\rm~acc}$);

    end for

    $Q^{(0)}$ $\gets$ comp-quality$\mathbb{P}^{(0)}$);

    while $Q^{(i)}$ not converge do

    $S^{(i)}$ $\gets$ sample$\mathbb{P}^{(i)}$, $S$);

    $M_{\rm~best}$, $M_{\rm~worst}$ $\gets$ pick$S^{(i)}$);

    $M_{\rm~mut}$ $\gets$ mutate$M_{\rm~best}$);

    $M_{\rm~mut}.{\rm~acc}$ $\gets$ train-eval$M_{\rm~mut}$, $\mathbb{D}$);

    $M_{\rm~mut}.{\rm~score}$ $\gets$ comp-score$M_{\rm~mut}$, $M_{\rm~mut}.{\rm~acc}$);

    $\mathbb{P}^{(i+1)}$ $\gets$ remove $M_{\rm~worst}$ from $\mathbb{P}^{(i)}$;

    $\mathbb{P}^{(i+1)}$ $\gets$ add $M_{\rm~mut}$ to $\mathbb{P}^{(i)}$;

    $Q^{(i+1)}$ $\gets$ comp-quality$\mathbb{P}^{(i+1)}$);

    $i++$;

    end while

    $M_{\rm~best}$ $\gets$ rerank-topk$\mathbb{P}_{\rm~best}$, $k$).

  • Table 11  

    Table 1Table 1

    ImageNet classification results in the mobile setting. The results of manually designed models are provided in the top section, other NAS results in the middle section, and the result of our models in the bottom section$^{\rm~a)}$

  •   

    Algorithm 2 Elastic architecture transfer

    Require:Datasets $\mathbb{D}_1$, $\mathbb{D}_2$, population size $P$.

    Output:The target architecture $\rm~Arch_{\rm~target}$.

    // Initialize the population on $\mathbb{D}_1$.

    $\mathbb{P}_1$ $\gets$ initialize$P$);

    evolve$\mathbb{P}_1$, $\mathbb{D}_1$);

    ${\rm~Arch}_{\rm~basic}$ $\gets$ rerank-topk select$\mathbb{P}_{1}$, $k$);

    // Initialize the population on $\mathbb{D}_2$.

    for $1~\le~i~\le~P$

    ${\rm~Arch}_i$ $\gets$ arch-perturbation${\rm~Arch}_{\rm~basic}$);

    $\mathbb{P}_2$.append${\rm~Arch}_i$);

    end for

    evolve$\mathbb{P}_2$, $\mathbb{D}_2$);

    ${\rm~Arch}_{\rm~target}$ $\gets$ rerank-topk select$\mathbb{P}_{2}$, $k$).

  • Table 2  

    Table 2Results on CIFAR-100$^{\rm~a)}$

    Model #Params (M) Top-1 Acc (%)
    ResNet [2] 1.7 72.8
    LS Evo [48] 40.4 77.0
    SS 2.2 77.4
    EATNet 1.9 78.1

    a) The comparison results are from [48]. “LS Evo": large-scale evolution. “SS": the model searched from scratch on CIFAR-100.

  •   

    Algorithm 3 Parameter sharing on the width-level

    Require:Kernel ${\boldsymbol~K}_l$ in layer $l$, the original kernel ${\boldsymbol~K}_o$.

    Output:Kernel ${\boldsymbol~K}_l$ in layer $l$.

    ${\rm~ch}_{\rm~in}^s$ $\gets$ min${\rm~ch}_{\rm~in}^l$, ${\rm~ch}_{\rm~in}^o$);

    ${\rm~ch}_{\rm~out}^s$ $\gets$ min${\rm~ch}_{\rm~out}^l$, ${\rm~ch}_{\rm~out}^o$);

    ${\boldsymbol~K}_l$ $\gets$ ${\boldsymbol~K}_o$($w^o$, $h^o$, ${\rm~ch}_{\rm~in}^s$, ${\rm~ch}_{\rm~out}^s$).

  • Table 3  

    Table 3Results of the contrast experiments on ImageNet$^{\rm~a)}$

    Model #Params (M) #Mult-Adds (M) Top-1/Top-5 Acc (%)
    SS 5.55 465 72.5/90.7
    Basic model 3.27 934 75.2/92.5
    Model-B 3.22 408 72.7/91.0
    EATNet-A 5.12 563 75.5/92.5
    EATNet-B 5.20 545 75.6/92.4
    EATNet-C 4.63 417 73.9/91.8

    a) “SS" denotes the model searched from scratch on ImageNet. The basic model searched on CIFAR-10 is directly applied on ImageNet without any modification. Model-B denotes the best model searched on ImageNet with a poor-performing basic architecture. EATNet-C is a small model searched by EAT-NAS.

  •   

    Algorithm 4 Architecture perturbation function

    Require:Basic architecture ${\rm~Arch}_b$, search space $\mathbb{S}$, number of blocks $N_{\text{blocks}}$, and primitives prims.

    Output:Perturbed architecture ${\rm~Arch}_p$.

    ${\rm~Arch}_p$ $\gets$ copy${\rm~Arch}_b$);

    for $1~\le~j~\le~N_{\text{blocks}}$

    prim $\gets$ rand-selectprims);

    value $\gets$ rand-generateprim, $\mathbb{S}$);

    $B^j_t$ $\gets$ get-block${\rm~Arch}_t,~j$);

    $B^j_t[$type$]$ $\gets$ value;

    end for

qqqq

Contact and support