Adam Vs Sgd What Are The Optimizers In Neural Network And Medium

Leo Migdal

-Nov 17, 2025, 12:45 PM

adam vs sgd what are the optimizers in neural network and medium

Welcome back to the Advanced Neural Tuning course. In the last lesson, you learned how adjusting the learning rate during training can help your neural network learn more efficiently. Now, we will focus on another key part of the training process: the optimizer. The optimizer is the algorithm that updates the weights of your neural network based on the gradients calculated during backpropagation. Choosing the right optimizer can make a big difference in how quickly your model learns and how well it performs. Just as with learning rate scheduling, the optimizer you select can help your model achieve better results, sometimes with less effort.

In this lesson, you will learn how to set up and compare two of the most popular optimizers in PyTorch: SGD and Adam. Before we look at the code, let’s briefly discuss what makes SGD and Adam different. SGD stands for Stochastic Gradient Descent. It is one of the simplest and most widely used optimizers. With SGD, the model’s weights are updated in the direction that reduces the loss, using a fixed learning rate. While it is simple and effective, it can sometimes be slow to converge, especially if the learning rate is not set well.

Adam, which stands for Adaptive Moment Estimation, is a more advanced optimizer. It keeps track of both the average of the gradients and the average of the squared gradients for each parameter. This allows Adam to adapt the learning rate for each parameter individually, often leading to faster and more stable training. In practice, Adam is a good default choice for many deep learning problems, but it is still important to understand and compare it with simpler methods like SGD. Optimization algorithms lie at the heart of training deep neural networks. As models grew deeper and datasets became larger, researchers realized that simple optimization techniques were no longer sufficient.

Over time, more advanced methods were introduced to overcome the limitations of earlier approaches. Here’s a walkthrough of how optimization techniques evolved, the motivations behind each new method, and when to use them. This diagram illustrates the evolution of optimization algorithms in deep learning, starting from the foundational Stochastic Gradient Descent (SGD). Over time, researchers built upon it to address its limitations: Here is a more detailed walkthrough of the gradient descent algorithms Stochastic Gradient Descent was originally proposed by Robbins and Monro in 1951 as a stochastic approximation method.

It was later widely adopted in neural network training, notably by LeCun et al. (1998) in their influential work “Efficient BackProp”. Stochastic Gradient Descent is the most fundamental optimization algorithm in deep learning. At each iteration, it updates parameters by moving them in the direction opposite to the gradient of the loss function. While SGD is simple and widely applicable, it often suffers from slow convergence and is highly sensitive to the choice of learning rate. Moreover, in scenarios involving ravines or highly curved surfaces, it can oscillate or take inefficient zig-zag paths toward the optimum.

Here, $\theta$ represents the model parameters, $\eta$ is the learning rate, and $\nabla J(\theta)$ is the gradient of the loss function. While simple and widely used, SGD often suffers from slow convergence and is highly sensitive to the choice of learning rate. Ever wondered why some machine learning models perform better than others? The secret might lie in the optimization algorithms they use. Two of these power players are Stochastic Gradient Descent (SGD) and Adam, each with its unique strengths. Let’s investigate deeper into the world of machine learning algorithms, specifically focusing on Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam).

These two optimization techniques each offer unique advantages when it comes to enhancing model performance. Stochastic Gradient Descent, or as it’s often referred by its acronym ‘SGD’, represents an iterative method used for optimizing objective functions. It is particularly effective in dealing with large-scale data sets because at each iteration only a single instance is utilized to compute the gradient. This approach provides significant computational benefits but may lead to noisy convergence paths due to frequent updates based on individual instances. For example, consider training a linear regression model using a massive dataset containing millions of entries. If you use traditional batch gradient descent methods that process all examples simultaneously before making parameter adjustments – your system could quickly become overwhelmed due to memory constraints alone!

With SGD but, gradients are calculated iteratively after processing just one example – keeping computations manageable even when datasets expand dramatically. On another front lies Adam optimizer; an algorithm well-praised for combining favorable properties from other optimization strategies while adding some improvements too! Optimizers determine how neural networks learn by updating parameters to minimize loss. The choice of optimizer significantly affects training speed and final performance. Watch how different optimizers navigate toward the minimum of a simple quadratic function: Step 0 / 100 | SGD Loss: 9.0000 | Momentum Loss: 9.0000 | Adam Loss: 9.0000

The simplest optimizer. Updates parameters directly proportional to the gradient magnitude. Where α is learning rate and ∇f(θ) is the gradient Stack Exchange network consists of 183 Q&A communities including Stack Overflow, the largest, most trusted online community for developers to learn, share their knowledge, and build their careers. Ask questions, find answers and collaborate at work with Stack Overflow Internal. Ask questions, find answers and collaborate at work with Stack Overflow Internal.

Explore Teams Connect and share knowledge within a single location that is structured and easy to search. As a part of my undergraduate project , I was firstly given the task to benchmark a couple of standard CNN models (i)custom model with 4 conv layers (ii)ResNET18 on the FashionMNIST and CIFAR-10... After that (as my project is based on segmentation task) , i implemented and ran a UNet network on a labelled segmentation SONAR dataset. Choosing the Best Optimizer for Your Deep Learning Model When training deep learning models, choosing the right optimization algorithm can significantly impact your model’s performance, convergence speed, and generalization ability. Below, we will explore some of the most popular optimization algorithms, their strengths, the reasons they were invented, and the types of problems they are best suited for.

1. Stochastic Gradient Descent (SGD) Why It Was Invented SGD is one of the earliest and most fundamental optimization algorithms used in machine learning and deep learning. It was invented to handle the challenge of minimizing cost functions efficiently, particularly when dealing with large datasets where traditional gradient descent methods would be computationally expensive. Inventor The concept of SGD is rooted in statistical learning, but its application in neural networks is often attributed to Yann LeCun and others in the 1990s. Formula The update rule for SGD is given by: where is the learning rate, is the gradient of the loss function with respect to the model parameters . Strengths and Limitations **Strengths:** SGD is particularly effective in cases where the model is simple, and the dataset is large, making it a robust choice for problems where generalization is important.

The simplicity of the algorithm ensures that it does not overfit easily. **Limitations:** The main limitation of SGD is its slow convergence, especially in complex landscapes with ravines (areas where the gradient changes sharply). The formula does not account for momentum, so it can get stuck in local minima or take a long time to converge. This is particularly problematic for models with highly non-convex loss surfaces, which is common in deep learning models. The slow convergence is mathematically evident when analyzing the eigenvalues of the Hessian matrix of the loss function, where high condition numbers lead to slow progress in optimizing the parameters. Best For Simple, small-scale models or when strong generalization is needed.

2. AdaGrad Why It Was Invented AdaGrad was developed to address the issue of SGD’s sensitivity to learning rate selection. It adapts the learning rate for each parameter based on its historical gradient, allowing for more robust training in scenarios with sparse data and features. Inventor AdaGrad was introduced by John Duchi, Elad Hazan, and Yoram Singer in 2011. Formula The update rule for AdaGrad is: where is the sum of the squares of past gradients, and is a small constant added for numerical stability. Strengths and Limitations **Strengths:** AdaGrad’s strength lies in its ability to adapt the learning rate for each parameter based on the historical gradients.

This makes it particularly suitable for sparse data, where some features occur infrequently and require larger updates. By dynamically adjusting the learning rate, AdaGrad ensures that these infrequent features are learned effectively. **Limitations:** The primary limitation is the decaying learning rate. As accumulates, the learning rate decreases, often to the point where the updates become too small to make further progress. This is particularly problematic for deep networks, where later layers require sustained learning rates to fine-tune the model. This is mathematically evident as grows larger, the denominator in the update rule increases, causing the overall step size to diminish.

Best For Sparse datasets and problems with infrequent features. 3. RMSprop Why It Was Invented RMSprop was developed to fix AdaGrad’s diminishing learning rate issue by introducing a moving average of the squared gradients, which allows the learning rate to remain effective throughout training. Inventor RMSprop was introduced by Geoffrey Hinton in his Coursera lecture on neural networks. Formula The update rule for RMSprop is: where is the moving average of the squared gradients. Strengths and Limitations **Strengths:** RMSprop addresses AdaGrad’s limitation by maintaining a moving average of the squared gradients, ensuring that the learning rate does not diminish too quickly.

This makes it particularly effective for training recurrent neural networks (RNNs), where maintaining a consistent learning rate is crucial for long-term dependencies. **Limitations:** While RMSprop effectively mitigates the learning rate decay issue, it may lead to suboptimal generalization. Since the algorithm adjusts the learning rate for each parameter individually, it can overfit certain parameters, particularly in complex models with many features. This overfitting occurs because RMSprop does not account for the correlations between parameters, which can lead to inconsistent updates that do not generalize well across different datasets. Best For Non-stationary problems or models with fluctuating gradients, particularly suitable for RNNs. 4.

Adam (Adaptive Moment Estimation) Why It Was Invented Adam was designed to combine the benefits of both AdaGrad and RMSprop by using both first and second moments of the gradients to adapt the learning... Inventor Adam was introduced by Diederik P. Kingma and Jimmy Ba in 2015. Formula The update rule for Adam is: trengths and Limitations **Strengths:** Adam is well-regarded for its efficiency and adaptability. By using estimates of both the first (mean) and second (variance) moments of the gradients, it provides a robust and stable learning rate throughout training, making it particularly effective in problems with noisy or... The adaptability of the learning rate helps in fast convergence, especially in deep networks.

Adam Vs Sgd What Are The Optimizers In Neural Network And Medium

People Also Search

Welcome Back To The Advanced Neural Tuning Course. In The

In This Lesson, You Will Learn How To Set Up

Adam, Which Stands For Adaptive Moment Estimation, Is A More

Over Time, More Advanced Methods Were Introduced To Overcome The

It Was Later Widely Adopted In Neural Network Training, Notably