logo

SCIENCE CHINA Information Sciences, Volume 64 , Issue 5 : 150102(2021) https://doi.org/10.1007/s11432-020-3163-0

Learning dynamics of gradient descent optimization in deep neural networks

More info
  • ReceivedApr 26, 2020
  • AcceptedNov 19, 2020
  • PublishedApr 8, 2021

Abstract


Acknowledgment

This work was supported by National Natural Science Foundation of China (Grant Nos. 61933013, U1736211), Strategic Priority Research Program of Chinese Academy of Sciences (Grant No. XDA22030301), Natural Science Foundation of Guangdong Province (Grant No. 2019A1515011076), and Key Project of Natural Science Foundation of Hubei Province (Grant No. 2018CFA024).


References

[1] Ruder S. An overview of gradient descent optimization algorithms. 2017,. arXiv Google Scholar

[2] An W P, Wang H Q, Sun Q Y, et al. A pid controller approach for stochastic optimization of deep networks. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2018. 8522--8531. Google Scholar

[3] Kim D, Kim J, Kwon J, et al. Depth-controllable very deep super-resolution network. In: Proceedings of International Joint Conference on Neural Networks, 2019. 1--8. Google Scholar

[4] Hinton G, Srivastava N, et al. Overview of mini-batch gradient descent. Neural Networks for Machine Learning Lecture 6a, 2012, 14(8). Google Scholar

[5] Qian N. On the momentum term in gradient descent learning algorithms. Neural Networks, 1999, 12: 145-151 CrossRef Google Scholar

[6] Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res, 2011, 12: 2121--2159. Google Scholar

[7] Zeiler M D. Adadelta: an adaptive learning rate method. 2012,. arXiv Google Scholar

[8] Dauphin Y N, de Vries H, Bengio Y. Equilibrated adaptive learning rates for nonconvex optimization. In: Proceedings of Conference and Workshop on Neural Information Processing Systems, 2015. Google Scholar

[9] Kingma D, Ba J. Adam: a method for stochastic optimization. In: Proceedings of International Conference on Learning Representations, 2015. 1--15. Google Scholar

[10] Reddi S J, Kale S, Kumar S. On the convergence of ADAM and beyond. In: Proceedings of International Conference on Learning Representations, 2018. 1--23. Google Scholar

[11] Luo L C, Xiong Y H, Liu Y, et al. Adaptive gradient methods with dynamic bound of learning rate. In: Proceedings of International Conference on Learning Representations, 2019. 1--19. Google Scholar

[12] Saxe A M, McClelland J L, Ganguli S. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. 2013,. arXiv Google Scholar

[13] Lee T H, Trinh H M, Park J H. Stability Analysis of Neural Networks With Time-Varying Delay by Constructing Novel Lyapunov Functionals. IEEE Trans Neural Netw Learning Syst, 2018, 29: 4238-4247 CrossRef Google Scholar

[14] Faydasicok O, Arik S. A novel criterion for global asymptotic stability of neutral type neural networks with discrete time delays. In: Proceedings of International Conference on Neural Information Processing, 2018. 353--360. Google Scholar

[15] Vidal R, Bruna J, et al. Mathematics of deep learning. 2017,. arXiv Google Scholar

[16] Chaudhari P, Oberman A, Osher S. Deep relaxation: partial differential equations for optimizing deep neural networks. Res Math Sci, 2018, 5: 30 CrossRef Google Scholar

[17] Wang H, Luo Y, An W. PID Controller-Based Stochastic Optimization Acceleration for Deep Neural Networks. IEEE Trans Neural Netw Learning Syst, 2020, 31: 5079-5091 CrossRef Google Scholar

[18] Cousseau F, Ozeki T, Amari S. Dynamics of Learning in Multilayer Perceptrons Near Singularities. IEEE Trans Neural Netw, 2008, 19: 1313-1328 CrossRef Google Scholar

[19] Amari S, Park H, Ozeki T. Singularities Affect Dynamics of Learning in Neuromanifolds. Neural Computation, 2006, 18: 1007-1065 CrossRef Google Scholar

[20] Bietti A, Mairal J. Group invariance, stability to deformations, and complexity of deep convolutional representations. J Mach Learn Res, 2019, 20: 876--924. Google Scholar

[21] Sutskever I, Martens J, Dahl G, et al. On the importance of initialization and momentum in deep learning. In: Proceedings of International Conference on Machine Learning, 2013. 1139--1147. Google Scholar

[22] Lecun Y, Bottou L, Bengio Y. Gradient-based learning applied to document recognition. Proc IEEE, 1998, 86: 2278-2324 CrossRef Google Scholar

[23] Li L S, Jamieson K, DeSalvo G, et al. Hyperband: a novel bandit-based approach to hyperparameter optimization. J Mach Learn Res, 2018, 18: 1--52. Google Scholar

  • Figure 1

    (Color online) Learning from random states to the acceptable parameter space $M$ or $M^*$. The coordinate system is formed by a deep learning parameter set including nodes and weights, and $r_1$, $r_2$, $r_3$, $r_4$ are different learning routes.

  • Figure 2

    (Color online) (a) Step response and (b) root locus of SGD ($K_0~>~0,~r>0$).

  • Figure 3

    (Color online) (a), (c) Step response and (b), (d) root locus of SGD momentum.

  • Figure 4

    (Color online) Updating tracks of classical momentum (left) and Nesterov accelerated gradient (right).

  • Figure 5

    (Color online) (a), (c) Step response and (b), (d) root locus of NAG.

  • Figure 6

    (Color online) Identification results on the MNIST dataset of different optimizers. (a) The training loss, (b) valid loss, (c) training accuracy, and (d) valid accuracy of SGD, momentum and NAG, respectively.

  • Figure 7

    (Color online) Identification results on the CIFAR-10 dataset of different optimizers. (a) The training loss, (b) valid loss, (c) training accuracy, and (d) valid accuracy of SGD, SGD momentum and NAG, respectively.

  • Table 1  

    Table 1Step-signal response of optimization models SGD, SGD momentum, and NAG

    Optimization model Parameters Transfer function Order Overshoot Settling time
    SGD $K_0,~r$ $~G_{\theta~_{\rm~sgd}~}~(s)~=~K_0~\cdot~\frac{r}{s}$ 1 No Long
    SGD momentum $K_0,~r,~\mu$ $G_{\theta~_{\rm~sgdm}~}~(s)~=~\frac{{K_0~r}}{{\mu~s^2~+~s}}$2 Depend on $~\xi~=~\frac{1}{{\sqrt~{4K_0~r\mu~-~1}~}}$ Middle-long
    NAG $K_0,~r,~\mu,~\alpha$ $G_{\theta~_{\rm~nag}~}~(s)~=~\frac{{K_0~r\mu~s~+~K_0~r~\alpha~}}{{\mu~s^2~+~s}}~$ 2Depend on $\xi~=~\frac{{\mu~K_0~r\alpha~+~1}}{{2\sqrt~{\mu~K_0~r\alpha~}~}}$ Middle-short
  • Table 2  

    Table 2Key performance index of SGD, SGD momentum and NAG on the MNIST dataset at given accuracy levels

    2*Train accuracy
    2*Index SGD SGD momentum NAG
    $K_0=10$, $r=0.05$ $K_0=10$, $r=0.05$, $\mu=0.9$ $K_0=10$, $r=0.05$, $\mu=0.9$
    6* $\geq~95%$ Epoch 8 5 3
    Time (s) 127.321 51.554 21.339
    Train loss 0.164 0.162 0.115
    Valid loss 0.157 0.145 0.103
    Train accuracy (%) 95.39 95.43 96.69
    Valid accuracy (%) 95.61 95.75 96.94
    6*$\geq~98%$ Epoch 19 11 5
    Time (s) 278.489 111.552 36.887
    Train loss 0.072 0.075 0.058
    Valid loss 0.085 0.087 0.069
    Train accuracy (%) 98.09 98.01 98.29
    Valid accuracy (%) 97.46 97.41 97.74
  • Table 3  

    Table 3Key performance index of SGD, SGD momentum and NAG on the CIFAR dataset at given accuracy levels

    2*Train accuracy
    2*Index SGD SGD momentum NAG
    $K_0=100$, $r=0.001$ $K_0=100$, $r=0.001$, $\mu=0.5$ $K_0=100$, $r=0.001$, $\mu=0.5$, $\alpha=0.5$
    6*
    $\geq~70%$
    Epoch 38 9 8
    Time (s) 3713.554 1949.627 2274.326
    Train loss 1.199 1.079 1.286
    Valid loss 1.258 1.574 1.547
    Train accuracy (%) 70.28 70.71 70.73
    Valid accuracy (%) 69.01 65.01 54.76
    6*
    $\geq~80%$
    Epoch 78 35 16
    Time (s) 6578.489 7784.554 4479.583
    Train loss 0.931 1.219 1.194
    Valid loss 0.924 1.034 1.364
    Train accuracy (%) 80.04 80.03 80.93
    Valid accuracy (%) 80.74 79.55 60.38
qqqq

Contact and support