The Best Learning Rate Schedules Medium

Leo Migdal

-Nov 17, 2025, 2:53 AM

Anybody that has trained a neural network knows that properly setting the learning rate during training is a pivotal aspect of getting the neural network to perform well. Additionally, the learning rate is typically varied along the training trajectory according to some learning rate schedule. The choice of this schedule also has a large impact on the quality of training. Most practitioners adopt a few, widely-used strategies for the learning rate schedule during training; e.g., step decay or cosine annealing. Many of these schedules are curated for a particular benchmark, where they have been determined empirically to maximize test accuracy after years of research. But, these strategies often fail to generalize to other experimental settings, raising an important question: what are the most consistent and useful learning rate schedules for training neural networks?

Within this overview, we will look at recent research into various learning rate schedules that can be used to train neural networks. Such research has discovered numerous strategies for the learning rate that are both highly effective and easy to use; e.g., cyclical or triangular learning rate schedules. By studying these methods, we will arrive at several practical takeaways… A Gentle Introduction to Learning Rate SchedulersImage by Author | ChatGPT Ever wondered why your neural network seems to get stuck during training, or why it starts strong but fails to reach its full potential? The culprit might be your learning rate – arguably one of the most important hyperparameters in machine learning.

While a fixed learning rate can work, it often leads to suboptimal results. Learning rate schedulers offer a more dynamic approach by automatically adjusting the learning rate during training. In this article, you’ll discover five popular learning rate schedulers through clear visualizations and hands-on examples. You’ll learn when to use each scheduler, see their behavior patterns, and understand how they can improve your model’s performance. We’ll start with the basics, explore sklearn’s approach versus deep learning requirements, then move to practical implementation using the MNIST dataset. By the end, you’ll have both the theoretical understanding and practical code to start using learning rate schedulers in your own projects.

Imagine you’re hiking down a mountain in thick fog, trying to reach the valley. The learning rate is like your step size – take steps too large, and you might overshoot the valley or bounce between mountainsides. Take steps too small, and you’ll move painfully slowly, possibly getting stuck on a ledge before reaching the bottom. This newsletter is supported by Alegion. As a research scientist at Alegion, I work on a range of problems from online learning to diffusion models. Feel free to check out our data annotation platform or contact me about potential collaboration/opportunities!

Welcome to the Deep (Learning) Focus newsletter. Each issue picks a single topic in deep learning research and comprehensively overviews related research. Feel free to subscribe to the newsletter, share it, or follow me on twitter if you enjoy it! Anybody that has trained a neural network knows that properly setting the learning rate during training is a pivotal aspect of getting the neural network to perform well. Additionally, the learning rate is typically varied along the training trajectory according to some learning rate schedule. The choice of this schedule also has a large impact on the quality of training.

Most practitioners adopt a few, widely-used strategies for the learning rate schedule during training; e.g., step decay or cosine annealing. Many of these schedules are curated for a particular benchmark, where they have been determined empirically to maximize test accuracy after years of research. But, these strategies often fail to generalize to other experimental settings, raising an important question: what are the most consistent and useful learning rate schedules for training neural networks? Within this overview, we will look at recent research into various learning rate schedules that can be used to train neural networks. Such research has discovered numerous strategies for the learning rate that are both highly effective and easy to use; e.g., cyclical or triangular learning rate schedules. By studying these methods, we will arrive at several practical takeaways, providing simple tricks that can be immediately applied to improving neural network training.

Sarah Lee AI generated Llama-4-Maverick-17B-128E-Instruct-FP8 9 min read · June 10, 2025 Take your machine learning models to the next level with our comprehensive guide to learning rate scheduling, covering advanced techniques and best practices. Learning rate scheduling is a crucial aspect of training machine learning models. It involves adjusting the learning rate during the training process to optimize the model's performance. In this section, we'll explore some advanced learning rate scheduling techniques that can help improve your model's performance. Cyclic learning rate scheduling involves oscillating the learning rate between a minimum and maximum value.

This technique is based on the idea that the optimal learning rate is not a fixed value, but rather a range of values. By cycling through this range, the model can explore different parts of the loss landscape and converge to a better optimum. The cyclic learning rate schedule can be implemented using the following formula: Learning rate is one of the most important hyperparameters in the training of neural networks, impacting the speed and effectiveness of the learning process. A learning rate that is too high can cause the model to oscillate around the minimum, while a learning rate that is too low can cause the training process to be very slow or... This article provides a visual introduction to learning rate schedulers, which are techniques used to adapt the learning rate during training.

In the context of machine learning, the learning rate is a hyperparameter that determines the step size at which an optimization algorithm (like gradient descent) proceeds while attempting to minimize the loss function. Now, let’s move on to learning rate schedulers. A learning rate scheduler is a method that adjusts the learning rate during the training process, often lowering it as the training progresses. This helps the model to make large updates at the beginning of training when the parameters are far from their optimal values, and smaller updates later when the parameters are closer to their optimal... Several learning rate schedulers are widely used in practice. In this article, we will focus on three popular ones:

I understand that learning data science can be really challenging… …especially when you are just starting out. That’s why I spent weeks creating a 46-week Data Science Roadmap with projects and study resources for getting your first data science job. A Discord community to help our data scientist buddies get access to study resources, projects, and job referrals. “Training a neural network is like steering a ship; too fast, and you might miss the mark; too slow, and you’ll drift away. When training neural networks, one of the most critical hyperparameters is the learning rate (η).

It controls how much the model updates its parameters in response to the computed gradient during optimization. Choosing the right learning rate is crucial for achieving optimal model performance, as it directly affects convergence speed, stability, and the generalization ability of the network. The learning rate determines how quickly or slowly a neural network learns from data. It plays a key role in finding the optimal set of weights that minimize the loss function. A well-chosen learning rate ensures: Choosing an inappropriate learning rate can lead to several issues:

The learning rate (η) is a fundamental hyperparameter in gradient-based optimization methods like Stochastic Gradient Descent (SGD) and its variants. It determines the step size in updating the model parameters (θ) during training. The standard gradient descent algorithm updates model parameters using the following formula: When training deep neural networks, it is often useful to reduce learning rate as the training progresses. This can be done by using pre-defined learning rate schedules or adaptive learning rate methods. In this article, I train a convolutional neural network on CIFAR-10 using differing learning rate schedules and adaptive learning rate methods to compare their model performances.

Learning rate schedules seek to adjust the learning rate during training by reducing the learning rate according to a pre-defined schedule. Common learning rate schedules include time-based decay, step decay and exponential decay. For illustrative purpose, I construct a convolutional neural network trained on CIFAR-10, using stochastic gradient descent (SGD) optimization algorithm with different learning rate schedules to compare the performances. Constant learning rate is the default learning rate schedule in SGD optimizer in Keras. Momentum and decay rate are both set to zero by default. It is tricky to choose the right learning rate.

By experimenting with range of learning rates in our example, lr=0.1 shows a relative good performance to start with. This can serve as a baseline for us to experiment with different learning rate strategies. The mathematical form of time-based decay is lr = lr0/(1+kt) where lr, k are hyperparameters and t is the iteration number. Looking into the source code of Keras, the SGD optimizer takes decay and lr arguments and update the learning rate by a decreasing factor in each epoch. Momentum is another argument in SGD optimizer which we could tweak to obtain faster convergence. Unlike classical SGD, momentum method helps the parameter vector to build up velocity in any direction with constant gradient descent so as to prevent oscillations.

A typical choice of momentum is between 0.5 to 0.9. When it comes to training deep neural networks, one of the crucial factors that significantly influences model performance is the learning rate. The learning rate determines the size of the steps taken during the optimization process and plays a pivotal role in determining how quickly or slowly a model converges to the optimal solution. In recent years, adaptive learning rate scheduling techniques have gained prominence for their effectiveness in optimizing the training process and improving model performance. Before delving into adaptive learning rate scheduling, let’s first understand why the learning rate is so important in training deep neural networks. In essence, the learning rate controls the amount by which we update the parameters of the model during each iteration of the optimization algorithm, such as stochastic gradient descent (SGD) or its variants.

When training neural networks, one of the most critical hyperparameters to tune is the learning rate (LR). The learning rate determines how much the model weights are updated in response to the gradient of the loss function during backpropagation. While a high learning rate might cause the training process to overshoot the optimal parameters, a low learning rate can make the process frustratingly slow or get the model stuck in suboptimal local minima. A learning rate scheduler dynamically adjusts the learning rate during training, offering a systematic way to balance the trade-off between convergence speed and stability. Instead of manually tuning the learning rate, schedulers automate its adjustment based on a predefined strategy or the model’s performance metrics, enhancing the efficiency and performance of the training process.

The Best Learning Rate Schedules Medium

People Also Search

Anybody That Has Trained A Neural Network Knows That Properly

Within This Overview, We Will Look At Recent Research Into

While A Fixed Learning Rate Can Work, It Often Leads

Imagine You’re Hiking Down A Mountain In Thick Fog, Trying

Welcome To The Deep (Learning) Focus Newsletter. Each Issue Picks