Learning Rate Warmup With Cosine Decay In Keras Tensorflow

Leo Migdal

-Nov 17, 2025, 5:36 AM

learning rate warmup with cosine decay in keras tensorflow

The learning rate is an important hyperparameter in deep learning networks - and it directly dictates the degree to which updates to weights are performed, which are estimated to minimize some given loss function. In SGD: $$ weight_{t+1} = weight_t - lr * \frac{derror}{dweight_t} $$ With a learning rate of 0, the updated weight is just back to itself - weightt. The learning rate is effectively a knob we can turn to enable or disable learning, and it has major influence over how much learning is happening, by directly controlling the degree of weight updates. Different optimizers utilize learning rates differently - but the underlying concept stays the same.

Needless to say, learning rates have been the object of many studies, papers and practitioner's benchmarks. Generally speaking, pretty much everyone agrees that a static learning rate won't cut it, and some type of learning rate reduction happens in most techniques that tune the learning rate during training - whether... A LearningRateSchedule that uses a cosine decay with optional warmup. See Loshchilov & Hutter, ICLR2016, SGDR: Stochastic Gradient Descent with Warm Restarts. For the idea of a linear warmup of our learning rate, see Goyal et al.. When we begin training a model, we often want an initial increase in our learning rate followed by a decay.

If warmup_target is an int, this schedule applies a linear increase per optimizer step to our learning rate from initial_learning_rate to warmup_target for a duration of warmup_steps. Afterwards, it applies a cosine decay function taking our learning rate from warmup_target to alpha for a duration of decay_steps. If warmup_target is None we skip warmup and our decay will take our learning rate from initial_learning_rate to alpha. It requires a step value to compute the learning rate. You can just pass a backend variable that you increment at each training step. The schedule is a 1-arg callable that produces a warmup followed by a decayed learning rate when passed the current optimizer step.

This can be useful for changing the learning rate value across different invocations of optimizer functions. A LearningRateSchedule that uses a cosine decay with optional warmup. See Loshchilov & Hutter, ICLR2016, SGDR: Stochastic Gradient Descent with Warm Restarts. For the idea of a linear warmup of our learning rate, see Goyal et al.. When we begin training a model, we often want an initial increase in our learning rate followed by a decay. If warmup_target is an int, this schedule applies a linear increase per optimizer step to our learning rate from initial_learning_rate to warmup_target for a duration of warmup_steps.

Afterwards, it applies a cosine decay function taking our learning rate from warmup_target to alpha for a duration of decay_steps. If warmup_target is None we skip warmup and our decay will take our learning rate from initial_learning_rate to alpha. It requires a step value to compute the learning rate. You can just pass a backend variable that you increment at each training step. The schedule is a 1-arg callable that produces a warmup followed by a decayed learning rate when passed the current optimizer step. This can be useful for changing the learning rate value across different invocations of optimizer functions.

Communities for your favorite technologies. Explore all Collectives Ask questions, find answers and collaborate at work with Stack Overflow Internal. Ask questions, find answers and collaborate at work with Stack Overflow Internal. Explore Teams Find centralized, trusted content and collaborate around the technologies you use most.

Connect and share knowledge within a single location that is structured and easy to search. Learn through the super-clean Baeldung Pro experience: No ads, dark-mode and 6 months free of IntelliJ Idea Ultimate to start with. When we’re training neural networks, choosing the learning rate (LR) is a crucial step. This value defines how each pass on the gradient changes the weights in each layer. In this tutorial, we’ll show how different strategies for defining the LR affect the accuracy of a model.

We’ll consider the warm-up scenario, which only includes a few initial iterations. For a more theoretical aspect of it, we refer to another article of ours. Here, we’ll focus on the implementation aspects and performance comparison of different approaches. To keep things simple, we use the well-known fashion MNIST dataset. Let’s start by loading the required libraries and this computer vision dataset with labels: A LearningRateSchedule that uses a cosine decay with optional warmup.

See Loshchilov & Hutter, ICLR2016, SGDR: Stochastic Gradient Descent with Warm Restarts. For the idea of a linear warmup of our learning rate, see Goyal et al.. When we begin training a model, we often want an initial increase in our learning rate followed by a decay. If warmup_target is an int, this schedule applies a linear increase per optimizer step to our learning rate from initial_learning_rate to warmup_target for a duration of warmup_steps. Afterwards, it applies a cosine decay function taking our learning rate from warmup_target to alpha for a duration of decay_steps. If warmup_target is None we skip warmup and our decay will take our learning rate from initial_learning_rate to alpha.

It requires a step value to compute the learning rate. You can just pass a backend variable that you increment at each training step. The schedule is a 1-arg callable that produces a warmup followed by a decayed learning rate when passed the current optimizer step. This can be useful for changing the learning rate value across different invocations of optimizer functions. The learning rate is an important hyperparameter in deep learning networks – and it directly dictates the degree to which updates to weights are performed, which are estimated to minimize some given loss function. In SGD:

$$weight_{t+1} = weight_t – lr * frac{derror}{dweight_t}$$ With a learning rate of 0, the updated weight is just back to itself – weightt. The learning rate is effectively a knob we can turn to enable or disable learning, and it has major influence over how much learning is happening, by directly controlling the degree of weight updates. Different optimizers utilize learning rates differently – but the underlying concept stays the same. Needless to say, learning rates have been the object of many studies, papers and practicioner’s benchmarks. Generally speaking, pretty much everyone agrees that a static learning rate won’t cut it, and some type of learning rate reduction happens in most techniques that tune the learning rate during training – whether...

Learning Rate Warmup With Cosine Decay In Keras Tensorflow

People Also Search

The Learning Rate Is An Important Hyperparameter In Deep Learning

Needless To Say, Learning Rates Have Been The Object Of

If Warmup_target Is An Int, This Schedule Applies A Linear

This Can Be Useful For Changing The Learning Rate Value

Afterwards, It Applies A Cosine Decay Function Taking Our Learning