Pdf Barzilai Borwein Step Size For Stochastic Gradient Descent
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs. This paper considers the question of adapting the step-size in Stochastic Gradient descent (SGD) and some of its variants.
It proposes to use the Barzilai Borwein (BB) method to automatically compute step-sizes in SGD and stochastic variance reduced gradient (SVRG) instead of relying on predefined fixed (decreasing) schemes. For SGD a smoothing technique is additionally used. The paper addresses an important question for SGD type of algorithms. The BB method is first implemented within the SVGR. The simulation are convincing is that the optimal step-size is learned after an adaptation phase. I am however wondering why in Figure 1 there is this strong overshoot towards much too small step-sizes in the first iterations.
It looks suboptimal. For the implementation within SGD, the author(s) need to introduce a smoothing technique. My concern is that within the smoothing formula they reintroduce a deterministic non-adaptive decrease. They explicitly reintroduce a decrease in 1/k+1. Hence the proposed adaptation scheme seems to present the same drawbacks as a predefined scheme. Other comments: In lemma 1, the expectation should be a conditional expectation.
2-Confident (read it all; understood it all reasonably well) The authors give a new analysis of SVRG. It allows using "Option I" (taking the final iterate of the inner iteration), as is done in practice. They also propose to use a scaled version of Barzilai-Borwein to set the step-sizse for SVRG (and heuristically argue that this could also be useful for classic stochastic gradient methods too). Their experiments show that this adaptive step-size is competitive with fixed step-sizes. Note that I increased my score in light of the experiments discussed in the author response.
I previously reviewed this paper for ICML. Below I've included some quotes from my ICML review that are still relevant. But first, I'll comment on some of the changes and lack of changes after the previous round of reviewing: 1. The authors have removed most of the misleading statements and included extra references and discussion, which I think makes the paper much better. 2. One reviewer brought up how the quadratic dependence on some of the problem constants is inferior to existing results.
I'm ok with this as having an automatic step-size is a big advantage, but the paper should point out explicitly that the bound is looser. (This reviewer also pointed out that achieving a "bad" rate under Option I is easy to establish, although in this case I agree with the authors that this contribution is novel). 3. The paper is *still* missing an empirical comparison with the existing heuristics that people use to tune the step-size. The SAG line-search is now discussed in the introduction (and note that this often works better than the "best tuned step size" since it can increase the step-size as it approaches the solution) but... To me this is a strange omission: if the proposed method actually works better than these methods then including these comparisons only makes the paper stronger.
Even if the proposed method works similarly to existing methods, it still makes the paper stronger because it shows that we now have a theoretically-justified way to get the same performance. Not including these comparisons is not only incomplete scholarship, but it makes the reader think there is something to hide. (I'm not saying there is something to hide, I'm just saying there are only good reasons to include these experiments and only bad reasons not to!) 4. One reviewer pointed out a severe restriction on the condition number in the previous submission, which has been fixed. --- Comments from old review --- Summary: The authors give a new analysis of SVRG. It allows using "Option I" (taking the final iterate of the inner iteration), as is done in practice.
They also propose to use a scaled version of Barzilai-Borwein to set the step-sizse for SVRG (and heuristically argue that this could also be useful for classic stochastic gradient methods too). Their experiments show that this adaptive step-size is competitive with fixed step-sizes. Clarity: The paper is very clearly-written and easy to understand (though many grammar issues remain). Significance: Although several heuristic adaptive step-size strategies exist in the literature, this is the first theoretically-justified method. It sill depends on constants that we don't know in general, but I believe is a step towards black-box SG methods. Details: Independent of the SVRG/SG results, the authors give a nice way to bound the step-size for the BB method.
Normally, BB leads to a much faster rate than using a constant step-size, but in the SVRG setting your theory/experiments are just showing that it does as well as the best step-size (which is... Finally, the paper would be much stronger if it compared to the two existing strategies that are used in practice: 1. The line-search of Le Roux et al. where they increase/decrease an estimate of L. 2. The line-search of Mairal where he empirically tries to find the best step-size.
However, I don't think that the proposed approach would actually work better than both of these methods (but these older approaches don't have any theory). The use of stochastic gradient algorithms for nonlinear optimization is of considerable interest, especially in the case of high dimensions. In this case, the choice of the step size is of key importance for the convergence rate. In this paper, we propose two new stochastic gradient algorithms that use an improved Barzilai–Borwein step size formula. Convergence analysis shows that these algorithms enable linear convergence in probability for strongly convex objective functions. Our computational experiments confirm that the proposed algorithms have better characteristics than two-point gradient algorithms and well-known stochastic gradient methods.
This is a preview of subscription content, log in via an institution to check access. Price excludes VAT (USA) Tax calculation will be finalised during checkout. K. Chaudhuri, C. Monteleoni, and D. Sarwate, “Differentially private empirical risk minimization,” J.
Mach. Learn. Res., No. 12, 1069–1109 (2011). H. Robbins and S.
Monro, “A stochastic approximation method,” Ann. Math. Stat. 22, 400–407 (1951). Academia.edu no longer supports Internet Explorer. To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
One of the major issues in stochastic gradient descent (SGD) methods is how to choose an appropriate step size while running the algorithm. Since the traditional line search technique does not apply for stochastic optimization algorithms, the common practice in SGD is either to use a diminishing step size, or to tune a fixed step size by... In this paper, we propose to use the Barzilai-Borwein (BB) method to automatically compute step sizes for SGD and its variant: stochastic variance reduced gradient (SVRG) method, which leads to two algorithms: SGD-BB and... We prove that SVRG-BB converges linearly for strongly convex objective functions. As a by-product, we prove the linear convergence result of SVRG with Option I proposed in [10], whose convergence result is missing in the literature. Numerical experiments on standard data sets show that the performance of SGD-BB and SVRG-BB is comparable to and sometimes even better than SGD and...
We design step-size schemes that make stochastic gradient descent (SGD) adaptive to (i) the noise σ in the stochastic gradients and (ii) problem-dependent constants. When minimizing smooth, stronglyconvex functions with condition number κ, we first prove that T iterations of SGD with Nesterov acceleration and exponentially decreasing step-sizes can achieve a nearoptimal Õ ( exp (−T/ √ κ)... Under a relaxed assumption on the noise, with the same step-size scheme and knowledge of the smoothness, we prove that SGD can achieve an Õ ( exp (−T/κ) + σ 2 /T ) rate. In order to be adaptive to the smoothness, we use a stochastic line-search (SLS) and show (via upper and lower-bounds) that SGD converges at the desired rate, but only to a neighbourhood of the... Next, we use SGD with an offline estimate of the smoothness, and prove convergence to the minimizer. However, its convergence is slowed down proportional to the estimation error and we prove a lower-boun...
In view of a direct and simple improvement of vanilla SGD, this paper presents a fine-tuning of its step-sizes in the mini-batch case. For doing so, one estimates curvature, based on a local quadratic model and using only noisy gradient approximations. One obtains a new stochastic first-order method (Step-Tuned SGD) which can be seen as a stochastic version of the classical Barzilai-Borwein method. Our theoretical results ensure almost sure convergence to the critical set and we provide convergence rates. Experiments on deep residual network training illustrate the favorable properties of our approach. For such networks we observe, during training, both a sudden drop of the loss and an improvement of test accuracy at medium stages, yielding better results than SGD, RMSprop, or ADAM.
Scientific Reports volume 15, Article number: 40389 (2025) Cite this article In modern machine learning, optimization algorithms are crucial; they steer the training process by skillfully navigating through complex, high-dimensional loss landscapes. Among these, stochastic gradient descent with momentum (SGDM) is widely adopted for its ability to accelerate convergence in shallow regions. However, SGDM struggles in challenging optimization landscapes, where narrow, curved valleys can lead to oscillations and slow progress. This paper introduces dual enhanced SGD (DESGD), which addresses these limitations by dynamically adapting both momentum and step size on the same update rules of SGDM. In two optimization test functions, the Rosenbrock and Sum Square functions, the suggested optimizer typically performs better than SGDM and Adam.
People Also Search
- PDF Barzilai-Borwein Step Size for Stochastic Gradient Descent
- Barzilai-Borwein Step Size for Stochastic Gradient Descent
- Reviews: Barzilai-Borwein Step Size for Stochastic Gradient Descent - NIPS
- Barzilai-Borwein step size for stochastic gradient descent ...
- PDF CompSys2101010Vang.fm - Springer
- Enhancing logit stochastic user equilibrium ... - Wiley Online Library
- (PDF) Stochastic Gradient Method with Barzilai-Borwein Step for ...
- A dual enhanced stochastic gradient descent method with dynamic ...
ArXivLabs Is A Framework That Allows Collaborators To Develop And
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add v...
It Proposes To Use The Barzilai Borwein (BB) Method To
It proposes to use the Barzilai Borwein (BB) method to automatically compute step-sizes in SGD and stochastic variance reduced gradient (SVRG) instead of relying on predefined fixed (decreasing) schemes. For SGD a smoothing technique is additionally used. The paper addresses an important question for SGD type of algorithms. The BB method is first implemented within the SVGR. The simulation are con...
It Looks Suboptimal. For The Implementation Within SGD, The Author(s)
It looks suboptimal. For the implementation within SGD, the author(s) need to introduce a smoothing technique. My concern is that within the smoothing formula they reintroduce a deterministic non-adaptive decrease. They explicitly reintroduce a decrease in 1/k+1. Hence the proposed adaptation scheme seems to present the same drawbacks as a predefined scheme. Other comments: In lemma 1, the expecta...
2-Confident (read It All; Understood It All Reasonably Well) The
2-Confident (read it all; understood it all reasonably well) The authors give a new analysis of SVRG. It allows using "Option I" (taking the final iterate of the inner iteration), as is done in practice. They also propose to use a scaled version of Barzilai-Borwein to set the step-sizse for SVRG (and heuristically argue that this could also be useful for classic stochastic gradient methods too). T...
I Previously Reviewed This Paper For ICML. Below I've Included
I previously reviewed this paper for ICML. Below I've included some quotes from my ICML review that are still relevant. But first, I'll comment on some of the changes and lack of changes after the previous round of reviewing: 1. The authors have removed most of the misleading statements and included extra references and discussion, which I think makes the paper much better. 2. One reviewer brought...