Convergence Of The Rmsprop Deep Learning Method With Penalty For
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs. RMSProp (Root Mean Square Propagation) is an adaptive learning rate optimization algorithm designed to improve the performance and speed of training deep learning models.
RMSProp was developed to address the limitations of previous optimization methods such as SGD (Stochastic Gradient Descent) and AdaGrad as SGD uses a constant learning rate which can be inefficient and AdaGrad reduces the... RMSProp balances by adapting the learning rates based on a moving average of squared gradients. This approach helps in maintaining a balance between efficient convergence and stability during the training process making RMSProp a widely used optimization algorithm in modern deep learning. RMSProp keeps a moving average of the squared gradients to normalize the gradient updates. By doing so it prevents the learning rate from becoming too small which was a drawback in AdaGrad and ensures that the updates are appropriately scaled for each parameter. This mechanism allows RMSProp to perform well even in the presence of non-stationary objectives making it suitable for training deep learning models.
The mathematical formulation is as follows: RMSProp is one of the most popular stochastic optimization algorithms in deep learning applications. However, recent work has pointed out that this method may not converge to the optimal solution even in simple convex settings. To this end, we propose a time-varying version of RMSProp to fix the non-convergence issues. Specifically, the hyperparameter, \(\beta _t\), is considered as a time-varying sequence rather than a fine-tuned constant. We also provide a rigorous proof that the RMSProp can converge to critical points even for smooth and non-convex objectives, with a convergence rate of order \(\mathcal {O}(\log T/\sqrt{T})\).
This provides a new understanding of RMSProp divergence, a common issue in practical applications. Finally, numerical experiments show that time-varying RMSProp exhibits advantages over standard RMSProp on benchmark datasets and support the theoretical results. This is a preview of subscription content, log in via an institution to check access. Price excludes VAT (USA) Tax calculation will be finalised during checkout. The datasets analysed during the current study are available in the following public domain resources: http://yann.lecun.com/exdb/mnist/; http://www.cs.toronto.edu/~kriz/cifar.html. https://github.com/kuangliu/pytorch-cifar
Huan Li, Yiming Dong, Zhouchen Lin; 26(131):1−25, 2025. Although adaptive gradient methods have been extensively used in deep learning, their convergence rates proved in the literature are all slower than that of SGD, particularly with respect to their dependence on the dimension. This paper considers the classical RMSProp and its momentum extension and establishes the convergence rate of $\frac{1}{T}\sum_{k=1}^TE\left[||\nabla f(\mathbf{x}^k)||_1\right]\leq O(\frac{\sqrt{d}C}{T^{1/4}})$ measured by $\ell_1$ norm without the bounded gradient assumption, where $d$ is the dimension of... Our convergence rate matches the lower bound with respect to all the coefficients except the dimension $d$. Since $||\mathbf{x}||_2\ll ||\mathbf{x}||_1\leq\sqrt{d}||\mathbf{x}||_2$ for problems with extremely large $d$, our convergence rate can be considered to be analogous to the $\frac{1}{T}\sum_{k=1}^TE\left[||\nabla f(\mathbf{x}^k)||_2\right]\leq O(\frac{C}{T^{1/4}})$ rate of SGD in the ideal case of $||\nabla f(\mathbf{x})||_1=\varTheta(\sqrt{d})||\nabla f(\mathbf{x})||_2$. [abs][pdf][bib] [code] © JMLR 2025.
(edit, beta)
People Also Search
- Convergence of the RMSProp deep learning method with penalty for ...
- The RMSprop optimizer. Introduction to the RMSprop optimizer | by ...
- On the $O(\\frac{\\sqrt{d}}{T^{1/4}})$ Convergence Rate of RMSProp and ...
- RMSProp Optimizer in Deep Learning - GeeksforGeeks
- PDF On hyper-parameter selection for guaranteed convergence of RMSProp
- On the O (sqrt (d)/T^ (1/4)) Convergence Rate of RMSProp and Its ...
- PDF A Sufficient Condition for Convergences of Adam and RMSProp
ArXivLabs Is A Framework That Allows Collaborators To Develop And
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add v...
RMSProp Was Developed To Address The Limitations Of Previous Optimization
RMSProp was developed to address the limitations of previous optimization methods such as SGD (Stochastic Gradient Descent) and AdaGrad as SGD uses a constant learning rate which can be inefficient and AdaGrad reduces the... RMSProp balances by adapting the learning rates based on a moving average of squared gradients. This approach helps in maintaining a balance between efficient convergence and ...
The Mathematical Formulation Is As Follows: RMSProp Is One Of
The mathematical formulation is as follows: RMSProp is one of the most popular stochastic optimization algorithms in deep learning applications. However, recent work has pointed out that this method may not converge to the optimal solution even in simple convex settings. To this end, we propose a time-varying version of RMSProp to fix the non-convergence issues. Specifically, the hyperparameter, \...
This Provides A New Understanding Of RMSProp Divergence, A Common
This provides a new understanding of RMSProp divergence, a common issue in practical applications. Finally, numerical experiments show that time-varying RMSProp exhibits advantages over standard RMSProp on benchmark datasets and support the theoretical results. This is a preview of subscription content, log in via an institution to check access. Price excludes VAT (USA) Tax calculation will be fin...
Huan Li, Yiming Dong, Zhouchen Lin; 26(131):1−25, 2025. Although Adaptive
Huan Li, Yiming Dong, Zhouchen Lin; 26(131):1−25, 2025. Although adaptive gradient methods have been extensively used in deep learning, their convergence rates proved in the literature are all slower than that of SGD, particularly with respect to their dependence on the dimension. This paper considers the classical RMSProp and its momentum extension and establishes the convergence rate of $\frac{1...