Cosine Learning Rate Decay Minibatch Ai

Leo Migdal

-Nov 17, 2025, 5:22 AM

In this post we will introduce the key hyperparameters involved in cosine decay and take a look at how the decay part can be achieved in TensorFlow and PyTorch. In a subsequent blog we will look at how to add restarts. A cosine learning rate decay schedule drops the learning rate in such a way it has the form of a sinusoid. Typically it is used with “restarts” where once the learning rate reaches a minimum value it is increased to a maximum value again (which might be different from the original max value) and it... The equation for decay as stated in SGDR: Stochastic Gradient Descent with Warm Restarts is as follows where $i$ means the $i$-th run of the decay.

Here will consider a single such run. The equation can be expanded (dropping the $i$ superscript) as the sum of a constant and a term that decays over the period $T$ and denoting $T_\text{cur}$ as $t$. The ever-growing availability of unlabeled data presents both opportunities and challenges for training artificial intelligence systems. While self-supervised learning (SSL) has emerged as a powerful paradigm for extracting meaningful representations from vast amounts of unlabeled data, existing methods still struggle to adapt to the non-stationary, non-IID nature of real-world data... Recent works have adopted a repeated cosine annealing schedule for large-scale continual pre-training; however, these schedules (1) inherently cause forgetting during the re-warming phase and (2) have not been systematically compared to existing continual... In this work, we systematically compare the widely used cosine schedule with the recently proposed infinite learning rate schedule and empirically find the latter to be a more effective alternative.

Our extensive empirical evaluation across diverse image and language datasets demonstrates that the infinite learning rate schedule consistently enhances continual pre-training performance compared to a repeated cosine decay without being restricted to a fixed... For instance, in a small-scale MAE pre-training setup, it outperforms several strong baselines from the literature. We then scale up our experiments to larger MAE pre-training and autoregressive language model pre-training. Our results show that the infinite learning rate schedule remains effective at scale, surpassing repeated cosine decay for both MAE pre-training and zero-shot LM benchmarks. Self-supervised (Balestriero et al., 2023) pre-training has emerged as a transformative paradigm in machine learning (He et al., 2022; Radford, 2018; Devlin et al., 2019), catalyzing the development of foundational models in vision (Radford... These models are known for their massive parameter counts and extensive training on vast amounts of data, often developing impressive general-purpose capabilities unexpectedly during pre-training (Brown et al., 2020; Wei et al., 2022).

While foundation models have demonstrated remarkable success on static tasks, adapting them to evolving data—such as the continuous influx of new textual information (Soldaini et al., 2024; Li et al., 2024; Abadji et al.,... This is primarily due to the high costs of retraining and the risk of catastrophic forgetting (McCloskey & Cohen, 1989) induced by significant distributional shifts. While recent studies (Ke et al., 2023; Qiao & Mahdavi, 2024; Yıldız et al., 2024; Parmar et al., 2024) provide guidelines for continual pre-training in language modeling, systematic approaches that seamlessly integrate into existing... In the context of computer vision, conventional CL approaches such as regularization techniques (Kirkpatrick et al., 2017; Li & Hoiem, 2017; Aljundi et al., 2018), and architectural modifications (Douillard et al., 2022; Yan et... These challenges stem from two core limitations: (1) their inability to generalize to self-supervised learning objectives and large-scale datasets, and (2) the architectural constraints they impose, which may not align with the diverse model... Most approaches for continually pre-training foundation models typically utilize a repeated cosine annealing schedule (Loshchilov & Hutter, 2017) with fixed duration (Gupta et al., 2023; Defazio et al., 2023; Ibrahim et al., 2024; Parmar...

Firstly, this implicitly assumes a terminal point in the training process, which severely limits the future pre-training on new datasets without undergoing significant forgetting. This fundamental limitation inhibits true continuous adaptation, as traditional learning rate schedules inevitably decay to near-zero values, effectively preventing further meaningful updates to the model. Secondly, re-warming the learning rate from its minimum value causes instability and exacerbates forgetting (Ibrahim et al., 2024). To overcome this constraint, recent works have explored more flexible infinite learning rate schedules that accommodate varying training durations (Zhai et al., 2022b; Defazio et al., 2024; Hu et al., 2024; Shen et al.,... While these innovations emerged primarily from data-scaling research, their applications have begun to extend into CL, as demonstrated in (Garg et al., 2024; Ibrahim et al., 2024). However, these works fail to answer a critical open question: How do these scheduling approaches behave under distribution shifts, i.e.

non-IID data distributions111Some previous works exploring infinite LR schedules (Ibrahim et al., 2024; Garg et al., 2024) considered different datasets stemming from splitting a single original dataset, leading to substantially weaker shifts than those... This scenario is particularly relevant for practical applications where models must continuously adapt to data from diverse domains. For instance, consider the challenge of continually pre-training an English language model to incorporate German. In such scenarios, catastrophic forgetting severely impacts model performance. Set the learning rate of each parameter group using a cosine annealing schedule. The learning rate is updated recursively using:

This implements a recursive approximation of the closed-form schedule proposed in SGDR: Stochastic Gradient Descent with Warm Restarts: ηt\eta_tηt is the learning rate at step ttt TcurT_{cur}Tcur is the number of epochs since the last restart So far we primarily focused on optimization algorithms for how to update the weight vectors rather than on the rate at which they are being updated. Nonetheless, adjusting the learning rate is often just as important as the actual algorithm. There are a number of aspects to consider:

Most obviously the magnitude of the learning rate matters. If it is too large, optimization diverges, if it is too small, it takes too long to train or we end up with a suboptimal result. We saw previously that the condition number of the problem matters (see e.g., Section 12.6 for details). Intuitively it is the ratio of the amount of change in the least sensitive direction vs. the most sensitive one. Secondly, the rate of decay is just as important.

If the learning rate remains large we may simply end up bouncing around the minimum and thus not reach optimality. Section 12.5 discussed this in some detail and we analyzed performance guarantees in Section 12.4. In short, we want the rate to decay, but probably more slowly than $\mathcal{O}(t^{-\frac{1}{2}})$ which would be a good choice for convex problems. Another aspect that is equally important is initialization. This pertains both to how the parameters are set initially (review Section 5.4 for details) and also how they evolve initially. This goes under the moniker of warmup, i.e., how rapidly we start moving towards the solution initially.

Large steps in the beginning might not be beneficial, in particular since the initial set of parameters is random. The initial update directions might be quite meaningless, too. Lastly, there are a number of optimization variants that perform cyclical learning rate adjustment. This is beyond the scope of the current chapter. We recommend the reader to review details in Izmailov et al. (2018), e.g., how to obtain better solutions by averaging over an entire path of parameters.

Cosine decay is a type of learning rate scheduling technique used during the training of deep learning models. Learning rate scheduling involves adjusting the learning rate (LR) over the course of training to help the model converge faster and potentially achieve better performance. The cosine decay learning rate schedule is based on the cosine function. It starts with an initial learning rate and gradually decreases the learning rate following a cosine curve. The learning rate decreases from the initial value towards zero, which can help the model fine-tune its parameters more accurately as training progresses. The formula for cosine decay can be expressed as:

Here, LR is the learning rate at a particular epoch, initial_LR is the initial learning rate, epoch is the current epoch number, and epochs is the maximum number of epochs. The warmup period is a technique used in learning rate scheduling to gradually increase the learning rate from an initial value to its target value over a certain number of initial epochs. This technique is commonly used in conjunction with other learning rate schedules, such as cosine decay, to improve the training stability and convergence of deep learning models. The idea behind the warmup period is to allow the model to initially explore the parameter space with a smaller learning rate before transitioning to the main learning rate schedule. This can help prevent large fluctuations in the loss function early in training and provide a smoother optimization process. In machine learning, particularly in deep learning, optimizing model performance requires not only selecting the right architecture but also fine-tuning the learning process.

One of the essential aspects of training models effectively is managing the learning rate — a parameter that determines how much a model’s weights are adjusted with respect to the loss gradient during each... Too high a learning rate can lead to unstable training, while too low a rate may result in slow convergence or getting stuck in local minima. Here’s where learning rate schedulers come in. Learning rate schedulers are tools that dynamically adjust the learning rate as training progresses, helping models converge more efficiently and often to a better solution. These schedulers work by modifying the learning rate over time based on predefined rules or performance metrics. For instance, a learning rate scheduler might decrease the rate over time to allow the model to take smaller, more refined steps as it nears optimal solutions.

Others might increase the learning rate at strategic points to help the model escape plateaus in the loss landscape. The goal is to balance stability and speed, helping models reach an optimal solution faster and more reliably. In PyTorch, learning rate schedulers are built directly into the library, making it easy for users to experiment with different scheduling strategies and tailor them to their specific needs. PyTorch offers a range of scheduling options — from basic, predefined schedules like StepLR, which decreases the learning rate by a factor at regular intervals, to more sophisticated ones like ReduceLROnPlateau, which reduces the... These schedulers are flexible, allowing us to customize parameters like learning rate decay rates, milestones, and conditions, making them a powerful tool in fine-tuning model performance. With PyTorch’s straightforward approach, integrating a learning rate scheduler into our model’s training loop becomes almost seamless, giving us the advantage of dynamically managing learning rates without needing extensive code modifications.

In this guide, I’ll dive deeper into one specific type of learning rate scheduler: the Cosine Annealing learning rate scheduler. Cosine annealing schedulers adjust the learning rate following a cosine curve, gradually reducing the rate over each cycle. This smooth decay pattern can help stabilize training, especially for models that may otherwise oscillate around suboptimal solutions. The cosine learning rate scheduler is particularly useful for scenarios where we want to fine-tune the model more carefully as it approaches convergence. It’s designed to lower the learning rate more gradually than step or exponential decay schedulers, and it often includes a restart mechanism, where the learning rate resets to its initial value at regular intervals... This restart helps the model escape from potential local minima by periodically taking larger steps, enabling it to search more thoroughly across the loss landscape.

Neural network training fails when learning rates stay constant throughout epochs. Static learning rates cause slow convergence or unstable training dynamics. Learning rate scheduling solves this problem by adjusting rates during training. This guide compares three popular scheduling methods: cosine decay, linear decay, and exponential decay. You'll learn implementation details, performance characteristics, and selection criteria for each approach. Learning rate scheduling dynamically adjusts the learning rate during neural network training.

The scheduler reduces learning rates as training progresses, allowing models to converge more effectively. Key benefits of learning rate scheduling: Cosine decay follows a cosine curve pattern, starting high and gradually decreasing to zero. This smooth transition provides excellent convergence properties for deep learning models. When working with computer vision models we frequently need to know information about the camera that was used to generate the image. This includes information about the position of the camera in the world coordinate system as well as the intrinsic properties of the camera such as the focal length.

Cosine Learning Rate Decay Minibatch Ai

People Also Search

In This Post We Will Introduce The Key Hyperparameters Involved

Here Will Consider A Single Such Run. The Equation Can

Our Extensive Empirical Evaluation Across Diverse Image And Language Datasets

While Foundation Models Have Demonstrated Remarkable Success On Static Tasks,

Firstly, This Implicitly Assumes A Terminal Point In The Training