Previous Articles     Next Articles

PROXIMAL-PROXIMAL-GRADIENT METHOD

Ernest K. Ryu, Wotao Yin   

  1. Department of Mathematics, University of California, Los Angeles, CA 90095, USA
  • Received:2018-12-18 Online:2019-12-15 Published:2019-12-15

Ernest K. Ryu, Wotao Yin. PROXIMAL-PROXIMAL-GRADIENT METHOD[J]. Journal of Computational Mathematics, 2019, 37(6): 778-812.

In this paper, we present the proximal-proximal-gradient method (PPG), a novel optimization method that is simple to implement and simple to parallelize. PPG generalizes the proximal-gradient method and ADMM and is applicable to minimization problems written as a sum of many differentiable and many non-differentiable convex functions. The non-differentiable functions can be coupled. We furthermore present a related stochastic variation, which we call stochastic PPG (S-PPG). S-PPG can be interpreted as a generalization of Finito and MISO over to the sum of many coupled non-differentiable convex functions.
We present many applications that can benefit from PPG and S-PPG and prove convergence for both methods. We demonstrate the empirical effectiveness of both methods through experiments on a CUDA GPU. A key strength of PPG and S-PPG is, compared to existing methods, their ability to directly handle a large sum of non-differentiable nonseparable functions with a constant stepsize independent of the number of functions. Such non-diminishing stepsizes allows them to be fast.

CLC Number: 

[1] T. Arjevani and O. Shamir, Communication complexity of distributed convex learning and optimization, In NIPS, (2015), 1756-1764.

[2] N.S. Aybat, Z. Wang, T. Lin, and S. Ma, Distributed linearized alternating direction method of multipliers for composite convex consensus optimization, IEEE Transactions on Automatic Control, 63:1(2018), 5-20.

[3] J.B. Baillon and G. Haddad, Quelques propriétés des opérateurs angle-bornés et n-cycliquement monotones, Israel Journal of Mathematics, 26:2(1977), 137-150.

[4] H.H. Bauschke and P.L. Combettes, The Baillon-Haddad Theorem Revisited, 17:3-4(2010), 781-787.

[5] H.H. Bauschke and P.L. Combettes, Convex Analysis and Monotone Operator Theory in Hilbert Spaces, 2011.

[6] D.P. Bertsekas, Incremental proximal methods for large scale convex optimization, Mathematical Programming, 129:2(2011), 163-195.

[7] P. Bianchi, Ergodic convergence of a stochastic proximal point algorithm, SIAM Journal on Optimization, 26:4(2016), 2235-2260.

[8] J. Bien, J. Taylor, and R. Tibshirani, A lasso for hierarchical interactions, The Annals of Statistics, 41:3(2013), 1111-1141.

[9] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, Distributed optimization and statistical learning via the alternating direction method of multipliers, Foundations and Trends in Machine Learning, 3:1(2011), 1-122.

[10] V. Chandrasekaran, S. Sanghavi, P.A. Parrilo, and A.S. Willsky, Rank-sparsity incoherence for matrix decomposition, SIAM Journal on Optimization, 21:2(2011), 572-596.

[11] X. Chen, Q. Lin, S. Kim, J.G. Carbonell, and E.P. Xing, Smoothing proximal gradient method for general structured sparse regression, The Annals of Applied Statistics, 6:2(2012), 719-752.

[12] P.L. Combettes and J.C. Pesquet, Proximal Splitting Methods in Signal Processing, (2011), 185-212.

[13] P.L. Combettes and J.C. Pesquet, Stochastic quasi-Fejér block-coordinate fixed point iterations with random sweeping, SIAM Journal on Optimization, 25:2(2015), 1221-1248.

[14] P.L. Combettes and V.R. Wajs, Signal recovery by proximal forward-backward splitting, Multiscale Modeling and Simulation, 4:4(2005), 1168-1200.

[15] C. Cortes and V. Vapnik, Support-vector networks, Machine Learning, 20:3(1995), 273-297.

[16] Damek Davis and Wotao Yin, A three-operator splitting scheme and its optimization applications, Set-Valued and Variational Analysis, Jun 2017.

[17] A. Defazio, F. Bach, and A. Lacoste-Julien, SAGA:A fast incremental gradient method with support for non-strongly convex composite objectives, In NIPS, (2014), 1646-1654.

[18] A. Defazio, J. Domke, and T.S. Caetano, Finito:A faster, permutable incremental gradient method for big data problems, In ICML, 32(2014), 1125-1133.

[19] W. Deng, M.J. Lai, Z. Peng, and W. Yin, Parallel multi-block ADMM with o(1/k) convergence, Journal of Scientific Computing, 2016.

[20] R.E. Fan, K.W. Chang, C.J. Hsieh, X.R. Wang, and C.J. Lin, LIBLINEAR:A library for large linear classification, Journal of Machine Learning Research, 9(2008), 1871-1874.

[21] D. Gabay and B. Mercier, A dual algorithm for the solution of nonlinear variational problems via finite element approximation, Computers and Mathematics with Applications, 2:1(1976), 17-40.

[22] R. Glowinski and A. Marroco, Sur l'approximation, par éléments finis d'ordre un, et la résolution, par pénalisation-dualité d'une classe de problmes de Dirichlet non linéaires. Revue Française d'Automatique, Informatique, Recherche Opérationnelle. Analyse Numérique, 9:2(1975), 41-76.

[23] D. Hallac, J. Leskovec, and S. Boyd, Network lasso:Clustering and optimization in large graphs, In KDD, (2015), 387-396.

[24] H. Hoefling, A path algorithm for the fused lasso signal approximator, Journal of Computational and Graphical Statistics, 194(2010), 984-1006.

[25] M. Jaggi, V. Smith, M. Takac, J. Terhorst, S. Krishnan, T. Hofmann, and M.I. Jordan, Communication-efficient distributed dual coordinate ascent, In NIPS. 2014.

[26] R. Jenatton, J.Y. Audibert, and F. Bach, Structured variable selection with sparsity-inducing norms. Journal of Machine Learning Research, 12(2011), 2777-2824.

[27] R. Johnson and T. Zhang, Accelerating stochastic gradient descent using predictive variance reduction, In NIPS, (2013), 315-323.

[28] S. Kim and E.P. Xing, Tree-guided group lasso for multi-task regression with structured sparsity, In ICML, (2010), 543-550.

[29] B. Kulis and P.L. Bartlett, Implicit online learning, In ICML, (2010), 575-582.

[30] G. Lan and Y. Zhou, An optimal randomized incremental gradient method, 2015.

[31] J. Langford, L. Li, and T. Zhang, Sparse online learning via truncated gradient, Journal of Machine Learning Research, 10(2009), 777-801.

[32] N. Le Roux, M. Schmidt, and F. Bach, Stochastic gradient method with an exponential convergence rate for finite training sets, In NIPS, 2012.

[33] J. Liu and J. Ye, Moreau-Yosida regularization for grouped tree structure learning, In NIPS, (2010), 1459-1467.

[34] J. Liu, L. Yuan, and J. Ye, An efficient algorithm for a class of fused lasso problems, In KDD, (2010), 323-332.

[35] J. Mairal, Optimization with first-order surrogate functions, In ICML, (2013), 783-791.

[36] J. Mairal, Incremental majorization-minimization optimization with application to large-scale machine learning, SIAM Journal on Optimization, 25:2(2015), 829-855.

[37] J. Mairal, R. Jenatton, F.R. Bach, and G.R. Obozinski, Network flow algorithms for structured sparsity, In NIPS, (2010), 1558-1566.

[38] P. McCullagh and J.A. Nelder, Generalized linear models, 2nd edition, 1989.

[39] H.B. McMahan, Follow-the-regularized-leader and mirror descent:Equivalence theorems and l1 regularization, In AISTATS, 2011.

[40] G.J. Minty, Monotone (nonlinear) operators in Hilbert space, Duke Mathematical Journal, 29:3(1962), 341-346.

[41] J.F.C. Mota, J.M.F. Xavier, P.M.Q. Aguiar, and M. Püschel, D-ADMM:A communicationefficient distributed algorithm for separable optimization, IEEE Transactions on Signal Processing, 61:10(2013), 2718-2723.

[42] A. Nitanda, Stochastic proximal gradient descent with acceleration techniques, In NIPS, (2014), 1574-1582.

[43] G. Nowak, T. Hastie, J.R. Pollack, and R. Tibshirani, A fused lasso latent feature model for analyzing multi-sample aCGH data, Biostatistics, 12:4(2011), 776-791.

[44] D.P. Palomar and M. Chiang, Alternative distributed algorithms for network utility maximization:Framework and applications, IEEE Transactions on Automatic Control, 52:12(2007), 2254-2269.

[45] N. Parikh and S. Boyd, Proximal algorithms, Foundations and Trends in Optimization, 1:3(2014), 127-239.

[46] G.B. Passty, Ergodic convergence to a zero of the sum of monotone operators in Hilbert space, Journal of Mathematical Analysis and Applications, 72:2(1979), 383-390.

[47] H. Raguet, J. Fadili, and G. Peyré, A generalized forward-backward splitting, SIAM Journal on Imaging Sciences, 6:3(2013), 1199-1226.

[48] F. Rapaport, E. Barillot, and J.P. Vert, Classification of arrayCGH data using fused SVM, Bioinformatics, 24:13(2008), i375-i382.

[49] H. Robbins and D. Siegmund, A convergence theorem for non negative almost supermartingales and some applications, In Herbert Robbins Selected Papers, (1985), 111-135.

[50] E.K. Ryu and S. Boyd, Stochastic proximal iteration:A non-asymptotic improvement upon stochastic gradient descent, 2014.

[51] E.K. Ryu and S. Boyd, Primer on monotone operator methods, Applied and Computational Mathematics, 15:1(2016), 3-43.

[52] A. Salim, P. Bianchi, W. Hachem, and J. Jakubowicz, A stochastic proximal point algorithm for total variation regularization over large scale graphs, In IEEE CDC, (2016), 4490-4495.

[53] M. Schmidt, N. Le Roux, and F. Bach, Minimizing finite sums with the stochastic average gradient, Mathematical Programming, (2016), 1-30.

[54] S. Shalev-Shwartz, SDCA without duality, regularization and individual convexity, In ICML, (2016), 747-754.

[55] S. Shalev-Shwartz and T. Zhang, Stochastic dual coordinate ascent methods for regularized loss, Journal of Machine Learning Research, 14(2013), 567-599.

[56] S. Shalev-Shwartz and T. Zhang, Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization, Mathematical Programming, 155:1(2016), 105-145.

[57] O. Shamir, N. Srebro, and T. Zhang, Communication-efficient distributed optimization using an approximate Newton-type method, In ICML, 2014.

[58] R. Tibshirani, M. Saunders, S. Rosset, J. Zhu, and K. Knight, Sparsity and smoothness via the fused lasso, Journal of the Royal Statistical Society:Series B, 67:1(2005), 91-108.

[59] R. Tibshirani and P. Wang, Spatial smoothing and hot spot detection for CGH data using the fused lasso, Biostatistics, 9:1(2008), 18.

[60] P. Toulis and E.M. Airoldi, Asymptotic and finite-sample properties of estimators based on stochastic gradients, Annals of Statistics (To appear), 2017.

[61] P. Toulis, J. Rennie, and E.M. Airoldi, Statistical analysis of stochastic gradient methods for generalized linear models, In ICML, (2014), 667-675.

[62] M. Wang and D.P. Bertsekas, Incremental constraint projection methods for variational inequalities, Mathematical Programming, 150:2(2015), 321-363.

[63] M. Wang and D.P. Bertsekas, Stochastic first-order methods with random constraint projection, SIAM Journal on Optimization, 26:1(2016), 681-717.

[64] L. Xiao and T. Zhang, A proximal stochastic gradient method with progressive variance reduction, SIAM Journal on Optimization, 24:4(2014), 2057-2075.

[65] E. Yang, A. Lozano, and P. Ravikumar, Elementary estimators for high-dimensional linear regression, In ICML, (2014), 388-396.

[66] T. Yang, Trading computation for communication:Distributed stochastic dual coordinate ascent, In NIPS. 2013.

[67] G.B. Ye and X. Xie, Split Bregman method for large scale fused lasso. Computational Statistics & Data Analysis, 55:4(2011), 1552-1569.

[68] L. Yuan, J. Liu, and J. Ye, Efficient methods for overlapping group lasso, In NIPS, (2011), 352-360.

[69] M. Yuan and Y. Lin, Model selection and estimation in regression with grouped variables, Journal of the Royal Statistical Society:Series B, 68:1(2006), 49-67.

[70] L. Zhang, M. Mahdavi, and R. Jin, Linear convergence with condition number independent access of full gradients, In NIPS, (2013), 980-988.

[71] Y. Zhang and L. Lin, Disco:Distributed optimization for self-concordant empirical loss, In ICML, 2015.

[72] Y. Zhang, M.J. Wainwright, and J.C. Duchi, Communication-efficient algorithms for statistical optimization. In NIPS, (2012), 1502-1510.

[73] P. Zhao, G. Rocha, and B. Yu, The composite absolute penalties family for grouped and hierarchical variable selection, The Annals of Statistics, 6A (2009), 3468-3497.

[74] J. Zhou, J. Liu, V.A. Narayan, and J. Ye, Modeling disease progression via fused sparse group lasso, In KDD, (2012), 1095-1103.
[1] Xiaoyu Wang, Ya-xiang Yuan. STOCHASTIC TRUST-REGION METHODS WITH TRUST-REGION RADIUS DEPENDING ON PROBABILISTIC MODELS [J]. Journal of Computational Mathematics, 2022, 40(2): 294-334.
[2] Weichao Kong, Jianjun Wang, Wendong Wang, Feng Zhang. ENHANCED BLOCK-SPARSE SIGNAL RECOVERY PERFORMANCE VIA TRUNCATED ?2/?1-2 MINIMIZATION [J]. Journal of Computational Mathematics, 2020, 38(3): 437-451.
[3] Ji Li, Jian-Feng Cai, Hongkai Zhao. ROBUST INEXACT ALTERNATING OPTIMIZATION FOR MATRIX COMPLETION WITH OUTLIERS [J]. Journal of Computational Mathematics, 2020, 38(2): 337-354.
[4] Zorana Lužanin, Irena Stojkovska, Milena Kresoja. DESCENT DIRECTION STOCHASTIC APPROXIMATION ALGORITHM WITH ADAPTIVE STEP SIZES [J]. Journal of Computational Mathematics, 2019, 37(1): 76-94.
[5] Yunfeng Cai, Zhaojun Bai, John E. Pask, N. Sukumar. CONVERGENCE ANALYSIS OF A LOCALLY ACCELERATED PRECONDITIONED STEEPEST DESCENT METHOD FOR HERMITIAN-DEFINITE GENERALIZED EIGENVALUE PROBLEMS [J]. Journal of Computational Mathematics, 2018, 36(5): 739-760.
[6] Lei Yang, Junhui Wang, Shiqian Ma. REDUCED-RANK MODELING FOR HIGH-DIMENSIONAL MODEL-BASED CLUSTERING [J]. Journal of Computational Mathematics, 2018, 36(3): 426-440.
[7] Cong D. Dang, Guanghui Lan, Zaiwen Wen. LINEARLY CONVERGENT FIRST-ORDER ALGORITHMS FOR SEMIDEFINITE PROGRAMMING [J]. Journal of Computational Mathematics, 2017, 35(4): 452-468.
[8] Jae Kyu Choi, Bin Dong, Xiaoqun Zhang. LIMITED TOMOGRAPHY RECONSTRUCTION VIA TIGHT FRAME AND SIMULTANEOUS SINOGRAM EXTRAPOLATION [J]. Journal of Computational Mathematics, 2016, 34(6): 575-589.
[9] Jinyan Fan, Jianyu Pan, Hongyan Song. A RETROSPECTIVE TRUST REGION ALGORITHM WITH TRUST REGION CONVERGING TO ZERO [J]. Journal of Computational Mathematics, 2016, 34(4): 421-436.
[10] Jinghui Liu, Changfeng Ma. A NEW NONMONOTONE TRUST REGION ALGORITHM FOR SOLVING UNCONSTRAINED OPTIMIZATION PROBLEMS [J]. Journal of Computational Mathematics, 2014, 32(4): 476-490.
[11] Hongen Jia, Kaimin Teng, Kaitai Li. ON THE ERROR ESTIMATES OF A NEW OPERATE SPLITTING METHOD FOR THE NAVIER-STOKES EQUATIONS [J]. Journal of Computational Mathematics, 2014, 32(1): 75-92.
[12] Fusheng Wang, Chuanlong Wang, Li Wang. A NEW TRUST-REGION ALGORITHM FOR FINITE MINIMAX PROBLEM [J]. Journal of Computational Mathematics, 2012, 30(3): 262-278.
[13] Liping Zhang, Liqun Qi, Yi Xu. LINEAR CONVERGENCE OF THE LZI ALGORITHM FOR WEAKLY POSITIVE TENSORS [J]. Journal of Computational Mathematics, 2012, 30(1): 24-33.
[14] Xuebin Wang, Changfeng Ma, Meiyan Li. A SMOOTHING TRUST REGION METHOD FOR NCPS BASED ON THE SMOOTHING GENERALIZED FISCHER-BURMEISTER FUNCTION [J]. Journal of Computational Mathematics, 2011, 29(3): 261-286.
[15] Hongwei Li, Xuecheng Tai , Sigurd Ivar Aanonsen . Reservoir Description by Using a Piecewise Constant Level Set Method [J]. Journal of Computational Mathematics, 2008, 26(3): 365-377.
Viewed
Full text


Abstract