• 论文 • 上一篇    下一篇

带自适应学习率的加速随机方差缩减梯度法

陈国茗1, 于腾腾2, 刘新为1   

  1. 1. 河北工业大学理学院, 天津 300401;
    2. 河北工业大学人工智能与数据科学学院, 天津 300401
  • 收稿日期:2020-01-10 出版日期:2021-09-15 发布日期:2021-09-17

陈国茗, 于腾腾, 刘新为. 带自适应学习率的加速随机方差缩减梯度法[J]. 数值计算与计算机应用, 2021, 42(3): 215-225.

Chen Guoming, Yu Tengteng, Liu Xinwei. ACCELERATED STOCHASTIC VARIANCE REDUCTION GRADIENT METHOD WITH ADAPTIVE LEARNING RATE[J]. Journal on Numerica Methods and Computer Applications, 2021, 42(3): 215-225.

ACCELERATED STOCHASTIC VARIANCE REDUCTION GRADIENT METHOD WITH ADAPTIVE LEARNING RATE

Chen Guoming1, Yu Tengteng2, Liu Xinwei1   

  1. 1. School of Science, Hebei University of Technology, Tianjin 300401, China;
    2. School of Artificial Intelligence, Hebei University of Technology, Tianjin 300401, China
  • Received:2020-01-10 Online:2021-09-15 Published:2021-09-17
由于随机方差缩减梯度(SVRG)法在求解经验风险最小化(ERM)问题时表现优异,近年来受到了广泛关注.与SVRG方法中使用固定的学习率不同,结合初始化偏差矫正技术,提出使用自适应方法来动态计算SVRG方法及其加速版本FSVRG方法的学习率,分别称为AdaSVRG方法和AdaFSVRG方法.收敛性分析表明,AdaSVRG方法和AdaFSVRG方法在强凸假设下均具有线性收敛速率.在标准数据集上的数值实验表明,在求解ERM问题时,AdaSVRG和AdaFSVRG需要更少的迭代次数就可以达到相同水平的优化间隙.
Due to its excellent performance in solving the ERM problem, SVRG has attracted extensive attention in recent years. Different from using fixed learning rate in SVRG, combined with the initialization deviation correction technology, the adaptive methods are proposed to dynamically calculate the learning rates of SVRG and its accelerated version FSVRG, which are called AdaSVRG and AdaFSVRG respectively. The convergence analysis shows that both AdaSVRG and AdaFSVRG have linear convergence rate under strong convex hypothesis. Numerical experiments on standard datasets show that AdaSVRG and AdaFSVRG require fewer iterations to achieve the same level of optimality gap when solving the ERM problem.

MR(2010)主题分类: 

()
[1] Robbins H, Monor S. A Stochastic approximation method[J]. Annals of Mathematical Statistics, 1951, 22(3):400-407.
[2] Nitanda A. Stochastic proximal gradient descent with acceleration techniques[C]//Advances in Neural Information Processing Systems. 2014:1574-1582.
[3] Konečný J, Liu J, Richtárik P, et al. mS2GD:Mini-batch semi-stochastic gradient descent in the proximal setting[J]. IEEE Journal of Selected Topics in Signal Processing, 2014, 10(2):242-255.
[4] Yu Y, Huang L. Exploring fast algorithms for composite optimization with serial and asynchronous realizations[J]. arXiv preprint arXiv:1805.10111v2, 2018.
[5] Johnson R, Zhang T. Accelerating stochastic gradient descent using predictive variance reduction[C]//Advances in Neural Information Processing Systems. 2013:315-323.
[6] Tan C, Ma S, Dai Y H, et al. Barzilai-Borwein step size for stochastic gradient descent[C]//Advances in Neural Information Processing Systems. 2016:685-693.
[7] Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization[J]. Journal of Machine Learning Research, 2011, 12(7):257-269.
[8] Zeiler M D. ADADELTA:an adaptive learning rate method[J]. arXiv preprint arXiv:1212.5701, 2012.
[9] Tieleman T, Hinton G E. Lecture 6.5-rmsprop:divide the gradient by a running average of its recent magnitude[J]. COURSERA:Neural Networks for Machine Learning, 2012, 4(2):26-31.
[10] Kingma D P, Ba J. Adam:A method for stochastic optimization[C]//Proceedings of the 3rd International Conference on Learning Representations. 2015:1-13.
[11] Rumelhart D E, Hinton G E, Williams R J. Learning representations by back-propagating errors[J]. Nature, 1986, 323(6088):533-536.
[12] Nesterov Y E. A method for solving the convex programming problem with convergence rate O(1/k2)[J]. Dokl Akad Nauk Sssr, 1983, 269(3):543-547.
[13] Allen-Zhu Z, Orecchia L. Linear coupling:an ultimate unification of gradient and mirror descent[C]//8th Innovations in Theoretical Computer Science Conference. 2017:1-22.
[14] Allen-Zhu Z. Katyusha:the first direct acceleration of stochastic gradient methods[J]. Journal of Machine Learning Research, 2017, 18(1):8194-8244.
[15] Shang F, Liu Y, Cheng J, et al. Fast stochastic variance reduced gradient method with momentum acceleration for machine learning[J]. arXiv preprint arXiv:1703.07948, 2017.
[16] 史加荣, 王丹, 尚凡华, 等. 随机梯度下降算法研究进展[J/OL]. 自动化学报:1-17[2021-07-26]. https://doi.org/10.16383/j.aas.c190260.
[1] 刘芳芳, 杨超. 一种提高SpMV向量化性能的新型稀疏矩阵存储格式[J]. 数值计算与计算机应用, 2014, 35(4): 269-276.
[2] 张春敏,  杨月婷. 两种混合共轭梯度法的全局收敛性[J]. 数值计算与计算机应用, 2012, 33(2): 92-98.
[3] 姚国柱, 廖安平, 段雪峰. 矩阵方程$AXB=C$的最小二乘Hamilton解[J]. 数值计算与计算机应用, 2009, 30(1): 48-57.
阅读次数
全文


摘要