Reviews Barzilai Borwein Step Size For Stochastic Gradient Descent Nip
This paper considers the question of adapting the step-size in Stochastic Gradient descent (SGD) and some of its variants. It proposes to use the Barzilai Borwein (BB) method to automatically compute step-sizes in SGD and stochastic variance reduced gradient (SVRG) instead of relying on predefined fixed (decreasing) schemes. For SGD a smoothing technique is additionally used. The paper addresses an important question for SGD type of algorithms. The BB method is first implemented within the SVGR. The simulation are convincing is that the optimal step-size is learned after an adaptation phase.
I am however wondering why in Figure 1 there is this strong overshoot towards much too small step-sizes in the first iterations. It looks suboptimal. For the implementation within SGD, the author(s) need to introduce a smoothing technique. My concern is that within the smoothing formula they reintroduce a deterministic non-adaptive decrease. They explicitly reintroduce a decrease in 1/k+1. Hence the proposed adaptation scheme seems to present the same drawbacks as a predefined scheme.
Other comments: In lemma 1, the expectation should be a conditional expectation. 2-Confident (read it all; understood it all reasonably well) The authors give a new analysis of SVRG. It allows using "Option I" (taking the final iterate of the inner iteration), as is done in practice. They also propose to use a scaled version of Barzilai-Borwein to set the step-sizse for SVRG (and heuristically argue that this could also be useful for classic stochastic gradient methods too). Their experiments show that this adaptive step-size is competitive with fixed step-sizes.
Note that I increased my score in light of the experiments discussed in the author response. I previously reviewed this paper for ICML. Below I've included some quotes from my ICML review that are still relevant. But first, I'll comment on some of the changes and lack of changes after the previous round of reviewing: 1. The authors have removed most of the misleading statements and included extra references and discussion, which I think makes the paper much better. 2.
One reviewer brought up how the quadratic dependence on some of the problem constants is inferior to existing results. I'm ok with this as having an automatic step-size is a big advantage, but the paper should point out explicitly that the bound is looser. (This reviewer also pointed out that achieving a "bad" rate under Option I is easy to establish, although in this case I agree with the authors that this contribution is novel). 3. The paper is *still* missing an empirical comparison with the existing heuristics that people use to tune the step-size. The SAG line-search is now discussed in the introduction (and note that this often works better than the "best tuned step size" since it can increase the step-size as it approaches the solution) but...
To me this is a strange omission: if the proposed method actually works better than these methods then including these comparisons only makes the paper stronger. Even if the proposed method works similarly to existing methods, it still makes the paper stronger because it shows that we now have a theoretically-justified way to get the same performance. Not including these comparisons is not only incomplete scholarship, but it makes the reader think there is something to hide. (I'm not saying there is something to hide, I'm just saying there are only good reasons to include these experiments and only bad reasons not to!) 4. One reviewer pointed out a severe restriction on the condition number in the previous submission, which has been fixed. --- Comments from old review --- Summary: The authors give a new analysis of SVRG.
It allows using "Option I" (taking the final iterate of the inner iteration), as is done in practice. They also propose to use a scaled version of Barzilai-Borwein to set the step-sizse for SVRG (and heuristically argue that this could also be useful for classic stochastic gradient methods too). Their experiments show that this adaptive step-size is competitive with fixed step-sizes. Clarity: The paper is very clearly-written and easy to understand (though many grammar issues remain). Significance: Although several heuristic adaptive step-size strategies exist in the literature, this is the first theoretically-justified method. It sill depends on constants that we don't know in general, but I believe is a step towards black-box SG methods.
Details: Independent of the SVRG/SG results, the authors give a nice way to bound the step-size for the BB method. Normally, BB leads to a much faster rate than using a constant step-size, but in the SVRG setting your theory/experiments are just showing that it does as well as the best step-size (which is... Finally, the paper would be much stronger if it compared to the two existing strategies that are used in practice: 1. The line-search of Le Roux et al. where they increase/decrease an estimate of L. 2.
The line-search of Mairal where he empirically tries to find the best step-size. However, I don't think that the proposed approach would actually work better than both of these methods (but these older approaches don't have any theory). The Barzilai–Borwein method[1] is an iterative gradient descent method for unconstrained optimization using either of two step sizes derived from the linear trend of the most recent two iterates. This method, and modifications, are globally convergent under mild conditions,[2][3] and perform competitively with conjugate gradient methods for many problems.[4] Not depending on the objective itself, it can also solve some systems of linear... To minimize a convex function f : R n → R {\displaystyle f:\mathbb {R} ^{n}\rightarrow \mathbb {R} } with gradient vector g {\displaystyle g} at point x {\displaystyle x} , let there be two... A Barzilai–Borwein (BB) iteration is x k + 1 = x k − α k g k {\displaystyle x_{k+1}=x_{k}-\alpha _{k}g_{k}} where the step size α k {\displaystyle \alpha _{k}} is either
Barzilai–Borwein also applies to systems of equations g ( x ) = 0 {\displaystyle g(x)=0} for g : R n → R n {\displaystyle g:\mathbb {R} ^{n}\rightarrow \mathbb {R} ^{n}} in which the Jacobian... Despite its simplicity and optimality properties, Cauchy's classical steepest-descent method[5] for unconstrained optimization often performs poorly.[6] This has motivated many to propose alternate search directions, such as the conjugate gradient method. Jonathan Barzilai and Jonathan Borwein instead proposed new step sizes for the gradient by approximating the quasi-Newton method, creating a scalar approximation of the Hessian estimated from the finite differences between two evaluation points... The use of stochastic gradient algorithms for nonlinear optimization is of considerable interest, especially in the case of high dimensions. In this case, the choice of the step size is of key importance for the convergence rate. In this paper, we propose two new stochastic gradient algorithms that use an improved Barzilai–Borwein step size formula.
Convergence analysis shows that these algorithms enable linear convergence in probability for strongly convex objective functions. Our computational experiments confirm that the proposed algorithms have better characteristics than two-point gradient algorithms and well-known stochastic gradient methods. This is a preview of subscription content, log in via an institution to check access. Price excludes VAT (USA) Tax calculation will be finalised during checkout. K. Chaudhuri, C.
Monteleoni, and D. Sarwate, “Differentially private empirical risk minimization,” J. Mach. Learn. Res., No. 12, 1069–1109 (2011).
H. Robbins and S. Monro, “A stochastic approximation method,” Ann. Math. Stat. 22, 400–407 (1951).
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
People Also Search
- Reviews: Barzilai-Borwein Step Size for Stochastic Gradient Descent - NIPS
- PDF Barzilai-Borwein Step Size for Stochastic Gradient Descent
- Barzilai-Borwein method - Wikipedia
- Barzilai-Borwein Step Size for Stochastic Gradient Descent
- Stochastic Gradient Method with Barzilai-Borwein Step for ... - Springer
- Barzilai-Borwein step size for stochastic gradient descent ...
- A faster path-based algorithm with Barzilai-Borwein step size for ...
- Enhancing logit stochastic user equilibrium ... - Wiley Online Library
This Paper Considers The Question Of Adapting The Step-size In
This paper considers the question of adapting the step-size in Stochastic Gradient descent (SGD) and some of its variants. It proposes to use the Barzilai Borwein (BB) method to automatically compute step-sizes in SGD and stochastic variance reduced gradient (SVRG) instead of relying on predefined fixed (decreasing) schemes. For SGD a smoothing technique is additionally used. The paper addresses a...
I Am However Wondering Why In Figure 1 There Is
I am however wondering why in Figure 1 there is this strong overshoot towards much too small step-sizes in the first iterations. It looks suboptimal. For the implementation within SGD, the author(s) need to introduce a smoothing technique. My concern is that within the smoothing formula they reintroduce a deterministic non-adaptive decrease. They explicitly reintroduce a decrease in 1/k+1. Hence t...
Other Comments: In Lemma 1, The Expectation Should Be A
Other comments: In lemma 1, the expectation should be a conditional expectation. 2-Confident (read it all; understood it all reasonably well) The authors give a new analysis of SVRG. It allows using "Option I" (taking the final iterate of the inner iteration), as is done in practice. They also propose to use a scaled version of Barzilai-Borwein to set the step-sizse for SVRG (and heuristically arg...
Note That I Increased My Score In Light Of The
Note that I increased my score in light of the experiments discussed in the author response. I previously reviewed this paper for ICML. Below I've included some quotes from my ICML review that are still relevant. But first, I'll comment on some of the changes and lack of changes after the previous round of reviewing: 1. The authors have removed most of the misleading statements and included extra ...
One Reviewer Brought Up How The Quadratic Dependence On Some
One reviewer brought up how the quadratic dependence on some of the problem constants is inferior to existing results. I'm ok with this as having an automatic step-size is a big advantage, but the paper should point out explicitly that the bound is looser. (This reviewer also pointed out that achieving a "bad" rate under Option I is easy to establish, although in this case I agree with the authors...