Learning An Adaptive Learning Rate Schedule Dblp

Leo Migdal

-Nov 17, 2025, 5:19 AM

learning an adaptive learning rate schedule dblp

Please note: Providing information about references and citations is only possible thanks to to the open metadata APIs provided by crossref.org and opencitations.net. If citation data of your publications is not openly available yet, then please consider asking your publisher to release your citation data to the public. For more information please see the Initiative for Open Citations (I4OC). Please also note that there is no way of submitting missing references or citation data directly to dblp. Please also note that this feature is work in progress and that it is still far from being perfect. That is, in particular,

JavaScript is requires in order to retrieve and display any references and citations for this record. To protect your privacy, all features that rely on external API calls from your browser are turned off by default. You need to opt-in for them to become active. All settings here will be stored as cookies with your web browser. For more information see our F.A.Q. Add open access links from to the list of external document links (if available).

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs. PyTorch implementation of the "Learning an Adaptive Learning Rate Schedule" paper found here: https://arxiv.org/abs/1909.09712.

Work in progress! A controller is optimized by PPO to generate adaptive learning rate schedules. Both the actor and the critic are MLPs with 2 hidden layers of size 32. Three distinct child network architectures are used: 1) an MLP with 3 hidden layers, 2) LeNet-5 and 3) ResNet-18. Learning rate schedules are evaluated on three different datasets: 1) MNIST, 2) Fashion-MNIST and 3) CIFAR10. Original paper experiments with combinations of Fashion-MNIST, CIFAR10, LeNet-5 and ResNet-18 only.

In each of the three settings, child networks are optimized using Adam with an initial learning rate in (1e-2, 1e-3, 1e-4) and are trained for 1000 steps on the full training set (40-50k samples)... 20-25 epochs. Learning rate schedules are evaluated based on validation loss over the course of training. Test loss and test accuracies are in the pipeline. Experiments are made in both a discrete and continuous setting. In the discrete setting, the controller controls the learning rate by proposing one of the following actions every 10 steps: 1) increase the learning rate, 2) decrease the learning rate, 3) do nothing.

In the continuous setting, the controller instead proposes a real-valued scaling factor, which allows the controller to modify learning rates with finer granularity. Maximum change per LR update has been set to 5% for simplicity (action space is not stated in the paper). In both the discrete and the continuous setting, Gaussian noise is optionally applied to learning rate updates. Observations for the controller contain information about current training loss, validation loss, variance of predictions, variance of prediction changes, mean and variance of the weights of the output layer as well as the previous... To make credit assignment easier, the validation loss at each step is used as reward signal rather than the final validation loss. Both observations and rewards are normalized by a running mean.

When it comes to training deep neural networks, one of the crucial factors that significantly influences model performance is the learning rate. The learning rate determines the size of the steps taken during the optimization process and plays a pivotal role in determining how quickly or slowly a model converges to the optimal solution. In recent years, adaptive learning rate scheduling techniques have gained prominence for their effectiveness in optimizing the training process and improving model performance. Before delving into adaptive learning rate scheduling, let’s first understand why the learning rate is so important in training deep neural networks. In essence, the learning rate controls the amount by which we update the parameters of the model during each iteration of the optimization algorithm, such as stochastic gradient descent (SGD) or its variants. Adaptive learning rate schedules are critical in optimizing the training process of NLP models.

They allow models to adjust the learning rate dynamically, enhancing convergence speed and ensuring stable training, especially with large datasets and deep architectures. Here’s a breakdown of the concept, types, and benefits: An adaptive learning rate schedule changes the learning rate during training, often based on performance metrics like loss. Rather than maintaining a fixed learning rate throughout the training process, adaptive schedules adjust it in response to various factors, such as the number of epochs, the model’s progress, or the validation loss. The key idea is to start with a relatively high learning rate for faster convergence and reduce it gradually to fine-tune the model at the later stages. There are several methods used to adapt the learning rate in NLP model training:

Step Decay: The learning rate decreases by a fixed factor after a set number of steps or epochs. For example, after every 10 epochs, the learning rate might drop by a factor of 0.1. Formula: lrt=lr0×γ⌊tstep_size⌋text{lr}_t = text{lr}_0 times gamma^{leftlfloor frac{t}{text{step_size}} rightrfloor}lrt=lr0×γ⌊step_sizet⌋

Learning An Adaptive Learning Rate Schedule Dblp

People Also Search

Please Note: Providing Information About References And Citations Is Only

JavaScript Is Requires In Order To Retrieve And Display Any

ArXivLabs Is A Framework That Allows Collaborators To Develop And

Work In Progress! A Controller Is Optimized By PPO To

In Each Of The Three Settings, Child Networks Are Optimized