logo

SCIENCE CHINA Information Sciences, Volume 62 , Issue 1 : 012101(2019) https://doi.org/10.1007/s11432-018-9656-y

A convergence analysis for a class of practical variance-reduction stochastic gradient MCMC

More info
  • ReceivedJun 26, 2018
  • AcceptedOct 26, 2018
  • PublishedDec 19, 2018

Abstract


Supplement

Appendixes A–F.


References

[1] Gan Z, Chen C Y, Henao R, et al. Scalable deep Poisson factor analysis for topic modeling. In: Proceedings of International Conference on Machine Learning, 2015. Google Scholar

[2] Liu C, Zhu J, Song Y. Stochastic gradient geodesic MCMC methods. In: Proceedings of Conference on Neural Information Processing System, 2016. Google Scholar

[3] Chen T, Fox E B, Guestrin C. Stochastic gradient Hamiltonian Monte Carlo. In: Proceedings of International Conference on Machine Learning, 2014. Google Scholar

[4] Ding N, Fang Y H, Babbush R, et al. Bayesian sampling using stochastic gradient thermostats. In: Proceedings of Conference on Neural Information Processing System, 2014. Google Scholar

[5] cSimcsekli U, Badeau R, Cemgil A T, et al. Stochastic quasi-Newton Langevin Monte Carlo. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[6] Wang Y X, Fienberg S E, Smola A. Privacy for free: posterior sampling and stochastic gradient Monte Carlo. In: Proceedings of International Conference on Machine Learning, 2015. Google Scholar

[7] Springenberg J T, Klein A, Falkner S, et al. Bayesian optimization with robust Bayesian neural networks. In: Proceedings of Conference on Neural Information Processing System, 2016. Google Scholar

[8] Li C Y, Chen C Y, Carlson D, et al. Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In: Proceedings of AAAI Conference on Artificial Intelligence, 2016. Google Scholar

[9] Chen C Y, Ding N, Carin L. On the convergence of stochastic gradient MCMC algorithms with high-order integrators. In: Proceedings of Conference on Neural Information Processing System, 2015. Google Scholar

[10] Dubey A, Reddi S J, Póczos B, et al. Variance reduction in stochastic gradient Langevin dynamics. In: Proceedings of Conference on Neural Information Processing System, 2016. Google Scholar

[11] Welling M, Teh Y W. Bayesian learning via stochastic gradient Langevin dynamics. In: Proceedings of International Conference on Machine Learning, 2011. Google Scholar

[12] Teh Y W, Thiery A H, Vollmer S J. Consistency and fluctuations for stochastic gradient Langevin dynamics. J Mach Learn Res, 2016, 17: 193--225. Google Scholar

[13] Vollmer S J, Zygalakis K C, Teh Y W. Exploration of the (Non-)asymptotic bias and variance of stochastic gradient Langevin dynamics. J Mach Learn Res, 2016, 17: 5504--5548. Google Scholar

[14] Ma Y A, Chen T Q, Fox E B. A complete recipe for stochastic gradient MCMC. In: Proceedings of International Conference on Machine Learning, 2015. Google Scholar

[15] Ghosh A P. Backward and forward equations for diffusion processes. Wiley Encyclopedia of Operations Research and Management Science, 2011. doi: 10.1002/9780470400531.eorms0080. Google Scholar

[16] Schmidt M, Le Roux N, Bach F. Minimizing finite sums with the stochastic average gradient. Math Program, 2017, 162: 83-112 CrossRef Google Scholar

[17] Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction. In: Proceedings of Conference on Neural Information Processing System, 2013. Google Scholar

[18] Reddi S J, Hefny A, Sra S, et al. Stochastic variance reduction for nonconvex optimization. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[19] Allen-Zhu Z, Hazan E. Variance reduction for faster non-convex optimization. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[20] Chen C Y, Ding N, Li C Y, et al. Stochastic gradient MCMC with stale gradients. In: Proceedings of Conference on Neural Information Processing System, 2016. Google Scholar

[21] Schmidt M, Roux N L, Bach F. Minimizing finite sums with the stochastic average gradient. 2013,. arXiv Google Scholar

[22] Zhang L J, Mahdavi M, Jin R. Linear convergence with condition number independent access of full gradients. In: Proceedings of Conference on Neural Information Processing System, 2013. Google Scholar

[23] Defazio A, Bach F, Lacoste-Julien S. SAGA: a fast incremental gradient method with support for non-strongly convex composite objectives. In: Proceedings of Conference on Neural Information Processing System, 2014. Google Scholar

[24] Reddi S J, Sra S, Póczos B. Fast stochastic methods for nonsmooth nonconvex optimization. In: Proceedings of Conference on Neural Information Processing System, 2016. Google Scholar

[25] Allen-Zhu Z, Richtárik P, Qu Z, et al. Even faster accelerated coordinate descent using non-uniform sampling. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[26] Reddi S J, Hefny A, Sra S, et al. On variance reduction in stochastic gradient descent and its asynchronous variants. In: Proceedings of Conference on Neural Information Processing System, 2015. Google Scholar

[27] Chen Y T, Ghahramani Z. Scalable discrete sampling as a multi-armed bandit problem. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[28] Bardenet R, Doucet A, Holmes C. On Markov chain Monte Carlo methods for tall data. J Mach Learn Res, 2017, 18: 1--43. Google Scholar

[29] Baker J, Fearnhead P, Fox E B, et al. Control variates for stochastic gradient MCMC. 2017,. arXiv Google Scholar

[30] Chatterji N S, Flammarion N, Ma Y A, et al. On the theory of variance reduction for stochastic gradient Monte Carlo. In: Proceedings of International Conference on Machine Learning, 2018. Google Scholar

[31] Zou D F, Xu P, Gu Q Q. Subsampled stochastic variance-reduced gradient Langevin dynamics. In: Proceedings of Conference on Uncertainty in Artificial Intelligence, 2018. Google Scholar

[32] Harikandeh R, Ahmed M O, Virani A, et al. Stop wasting my gradients: practical SVRG. In: Proceedings of Conference on Neural Information Processing System, 2015. Google Scholar

[33] Frostig R, Ge R, Kakade S M, et al. Competing with the empirical risk minimizer in a single pass. In: Proceedings of Conference on Learning Theory, 2015. Google Scholar

[34] Shah V, Asteris M, Kyrillidis A, et al. Trading-off variance and complexity in stochastic gradient descent. 2016,. arXiv Google Scholar

[35] Lei L H, Jordan M I. Less than a single pass: stochastically controlled stochastic gradient method. In: Proceedings of Conference on Neural Information Processing System, 2016. Google Scholar

[36] Lian X R, Wang M D, J Liu. Finite-sum composition optimization via variance reduced gradient descent. In: Proceedings of International Conference on Artificial Intelligence and Statistics, 2017. Google Scholar

[37] Hernández-Lobato J M, Adams R P. Probabilistic backpropagation for scalable learning of Bayesian neural networks. In: Proceedings of International Conference on Machine Learning, 2015. Google Scholar

[38] Blundell C, Cornebise J, Kavukcuoglu K, et al. Weight uncertainty in neural networks. In: Proceedings of International Conference on Machine Learning, 2015. Google Scholar

[39] Louizos C, Welling M. Structured and efficient variational deep learning with matrix Gaussian posteriors. In: Proceedings of International Conference on Machine Learning, 2016. Google Scholar

[40] He K M, Zhang X Y, Ren S Q, et al. Deep residual learning for image recognition. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, 2016. Google Scholar

[41] Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation, 1997, 9: 1735-1780 CrossRef Google Scholar

[42] Merity S, Xiong C M, Bradbury J, et al. Pointer sentinel mixture models. 2016,. arXiv Google Scholar

[43] Zaremba W, Sutskever I, Vinyals O. Recurrent neural network regularization. 2014,. arXiv Google Scholar

[44] Zhang Y Z, Chen C Y, Gan Z, et al. Stochastic gradient monomial Gamma sampler. In: Proceedings of International Conference on Machine Learning, 2017. Google Scholar

  • Figure 1

    (Color online) MSE vs. wall-clock time for different computational budgets (running time) with varying minibatch sizes for standard SG-MCMC. For very limited computational resources, larger minibatch sizes tend to achieve better MSE (a); while for a larger computational budget, a minibatch size of 1 obtains the smallest MSE (b).

  • Figure 2

    (Color online) Number of passes through data vs. testing error ((a) and (c)) / loss ((b) and (d)) on MNIST ((a) and (b)) and CIFAR-10 ((c) and (d)).

  • Figure 3

    (Color online) Number of passes through data vs. testing error ((a) and (c)) / loss ((b) and (d)) with CNN-4 ((a) and (b)) and ResNet ((c) annd (d)) on CIFAR-10.

  • Figure 4

    (Color online) Number of passes through data vs. testing perplexity on the PTB dataset (a) and WikiTest-2 dataset (b).

  • Figure 5

    (Color online) Number of passes through data vs. testing negative log-likelihood on the Pima dataset for Bayesian logistic regression (a); number of passes through data vs. testing errors (b) / loss (c) on the CIFAR-10 dataset, with varying $n_1$ values.

  •   

    Algorithm 1 Practical variance-reduction SG-MCMC.

    Input: $\bar{{\boldsymbol~x}}~=~{\boldsymbol~x}_{0}~=~({\boldsymbol~\theta}_0,~{\boldsymbol~\tau}_0)~\in~\mathbb{R}^d$, minibatch sizes $(n_1,~n_2)$ such that $n_1~>~n_2$, update interval $m$, total iterations $L$, stepsize $\{h_l\}_{l=1}^L$.

    Output: Approximate samples $\{{\boldsymbol~x}_{l}\}_{l=1}^L$.

    for $l~=~0$ to $L-1$

    if $(l~\mbox{~mod~}~m)~=~0$ then

    Sample w/t replacement $\{\pi_i\}_{i=1}^{n_1}~\subseteq~\{1,~\ldots,~N\}$;

    $\bar{{\boldsymbol~x}}~=~{\boldsymbol~x}_l$, $\tilde{{\boldsymbol~\theta}}_l~\leftarrow~\bar{{\boldsymbol~x}}$;

    $\tilde{g}~=~\frac{N}{n_1}\sum_{i\in\pi}\nabla_{{\boldsymbol~\theta}}\log~p({\boldsymbol~d}_i|\tilde{{\boldsymbol~\theta}}_l)$;

    end if

    ${\boldsymbol~\theta}_l~\leftarrow~{\boldsymbol~x}_l$, $\tilde{{\boldsymbol~\theta}}_l~\leftarrow~\bar{{\boldsymbol~x}}$,

    {Fix $\tilde{{\boldsymbol~\theta}}_l$ in the outer loop as a control variate;}

    Sample w/t replacement $\{\tilde{\pi}_i\}_{i=1}^{n_2}~\subseteq~\{1,~\ldots,~N\}$;

    $g_{l+1}~=~\tilde{g}~+~\nabla_{{\boldsymbol~\theta}}\log~p({\boldsymbol~\theta}_{l})~+~\frac{N}{n_2}~\sum_{i\in\tilde{\pi}}(\nabla_{{\boldsymbol~\theta}}\log~p({\boldsymbol~d}_i|{\boldsymbol~\theta}_l)~-~\nabla_{{\boldsymbol~\theta}}\log~p({\boldsymbol~d}_i|\tilde{{\boldsymbol~\theta}}_l)~)$;

    ${\boldsymbol~x}_{l+1}~=~\mbox{NextS}({\boldsymbol~x}_{l},~g_{l+1},~h_{l+1}~)$;

    end for