3 7 Weight Decay Dive Into Deep Learning 1 0 3 Documentation D2l

Leo Migdal

-Nov 17, 2025, 6:32 AM

3 7 weight decay dive into deep learning 1 0 3 documentation d2l

Now that we have characterized the problem of overfitting, we can introduce our first regularization technique. Recall that we can always mitigate overfitting by collecting more training data. However, that can be costly, time consuming, or entirely out of our control, making it impossible in the short run. For now, we can assume that we already have as much high-quality data as our resources permit and focus the tools at our disposal when the dataset is taken as a given. Recall that in our polynomial regression example (Section 3.6.2.1) we could limit our model’s capacity by tweaking the degree of the fitted polynomial. Indeed, limiting the number of features is a popular technique for mitigating overfitting.

However, simply tossing aside features can be too blunt an instrument. Sticking with the polynomial regression example, consider what might happen with high-dimensional input. The natural extensions of polynomials to multivariate data are called monomials, which are simply products of powers of variables. The degree of a monomial is the sum of the powers. For example, \(x_1^2 x_2\), and \(x_3 x_5^2\) are both monomials of degree 3. Note that the number of terms with degree \(d\) blows up rapidly as \(d\) grows larger.

Given \(k\) variables, the number of monomials of degree \(d\) is \({k - 1 + d} \choose {k - 1}\). Even small changes in degree, say from \(2\) to \(3\), dramatically increase the complexity of our model. Thus we often need a more fine-grained tool for adjusting function complexity. Rather than directly manipulating the number of parameters, weight decay, operates by restricting the values that the parameters can take. More commonly called \(\ell_2\) regularization outside of deep learning circles when optimized by minibatch stochastic gradient descent, weight decay might be the most widely used technique for regularizing parametric machine learning models. The technique is motivated by the basic intuition that among all functions \(f\), the function \(f = 0\) (assigning the value \(0\) to all inputs) is in some sense the simplest, and that we...

But how precisely should we measure the distance between a function and zero? There is no single right answer. In fact, entire branches of mathematics, including parts of functional analysis and the theory of Banach spaces, are devoted to addressing such issues. One simple interpretation might be to measure the complexity of a linear function \(f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}\) by some norm of its weight vector, e.g., \(\| \mathbf{w} \|^2\). Recall that we introduced the \(\ell_2\) norm and \(\ell_1\) norm, which are special cases of the more general \(\ell_p\) norm, in Section 2.3.11. The most common method for ensuring a small weight vector is to add its norm as a penalty term to the problem of minimizing the loss.

Thus we replace our original objective, minimizing the prediction loss on the training labels, with new objective, minimizing the sum of the prediction loss and the penalty term. Now, if our weight vector grows too large, our learning algorithm might focus on minimizing the weight norm \(\| \mathbf{w} \|^2\) rather than minimizing the training error. That is exactly what we want. To illustrate things in code, we revive our previous example from Section 3.1 for linear regression. There, our loss was given by There was an error while loading.

Please reload this page. Interactive deep learning book with code, math, and discussions Implemented with PyTorch, NumPy/MXNet, JAX, and TensorFlow Adopted at 500 universities from 70 countries Star You can modify the code and tune hyperparameters to get instant feedback to accumulate practical experiences in deep learning. We offer an interactive learning experience with mathematics, figures, code, text, and discussions, where concepts and techniques are illustrated and implemented with experiments on real data sets. You can discuss and learn with thousands of peers in the community through the link provided in each section. Now that we have characterized the problem of overfitting, we can introduce some standard techniques for regularizing models.

Recall that we can always mitigate overfitting by going out and collecting more training data. That can be costly, time consuming, or entirely out of our control, making it impossible in the short run. For now, we can assume that we already have as much high-quality data as our resources permit and focus on regularization techniques. Recall that in our polynomial regression example (Section 4.4) we could limit our model’s capacity simply by tweaking the degree of the fitted polynomial. Indeed, limiting the number of features is a popular technique to mitigate overfitting. However, simply tossing aside features can be too blunt an instrument for the job.

Sticking with the polynomial regression example, consider what might happen with high-dimensional inputs. The natural extensions of polynomials to multivariate data are called monomials, which are simply products of powers of variables. The degree of a monomial is the sum of the powers. For example, \(x_1^2 x_2\), and \(x_3 x_5^2\) are both monomials of degree 3. Note that the number of terms with degree \(d\) blows up rapidly as \(d\) grows larger. Given \(k\) variables, the number of monomials of degree \(d\) (i.e., \(k\) multichoose \(d\)) is \({k - 1 + d} \choose {k - 1}\).

Even small changes in degree, say from \(2\) to \(3\), dramatically increase the complexity of our model. Thus we often need a more fine-grained tool for adjusting function complexity. We have described both the \(L_2\) norm and the \(L_1\) norm, which are special cases of the more general \(L_p\) norm in Section 2.3.10. Weight decay (commonly called \(L_2\) regularization), might be the most widely-used technique for regularizing parametric machine learning models. The technique is motivated by the basic intuition that among all functions \(f\), the function \(f = 0\) (assigning the value \(0\) to all inputs) is in some sense the simplest, and that we... But how precisely should we measure the distance between a function and zero?

There is no single right answer. In fact, entire branches of mathematics, including parts of functional analysis and the theory of Banach spaces, are devoted to answering this issue. One simple interpretation might be to measure the complexity of a linear function \(f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}\) by some norm of its weight vector, e.g., \(\| \mathbf{w} \|^2\). The most common method for ensuring a small weight vector is to add its norm as a penalty term to the problem of minimizing the loss. Thus we replace our original objective, minimizing the prediction loss on the training labels, with new objective, minimizing the sum of the prediction loss and the penalty term. Now, if our weight vector grows too large, our learning algorithm might focus on minimizing the weight norm \(\| \mathbf{w} \|^2\) vs.

minimizing the training error. That is exactly what we want. To illustrate things in code, let us revive our previous example from Section 3.1 for linear regression. There, our loss was given by This section displays classes and functions (sorted alphabetically) in the d2l package, showing where they are defined in the book so you can find more detailed implementations and explanations. See also the source code on the GitHub repository.

Defines the computation performed at every call. Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while... The residual connection followed by layer normalization. Defines the computation performed at every call. An interactive deep learning book with code, math, and discussions Provides Deep Java Library(DJL) implementations Adopted at 175 universities from 40 countries Amazon Scientist CMU Assistant Professor

Amazon Research ScientistMathematics for Deep Learning Amazon Applied ScientistMathematics for Deep Learning Postdoctoral Researcher at ETH Zürich Recommender Systems Now that we have characterized the problem of overfitting, we can introduce some standard techniques for regularizing models. Recall that we can always mitigate overfitting by going out and collecting more training data. That can be costly, time consuming, or entirely out of our control, making it impossible in the short run.

For now, we can assume that we already have as much high-quality data as our resources permit and focus on regularization techniques. Recall that in our polynomial curve-fitting example (Section 4.4) we could limit our model’s capacity simply by tweaking the degree of the fitted polynomial. Indeed, limiting the number of features is a popular technique to avoid overfitting. However, simply tossing aside features can be too blunt an instrument for the job. Sticking with the polynomial curve-fitting example, consider what might happen with high-dimensional inputs. The natural extensions of polynomials to multivariate data are called monomials, which are simply products of powers of variables.

The degree of a monomial is the sum of the powers. For example, \(x_1^2 x_2\), and \(x_3 x_5^2\) are both monomials of degree \(3\). Note that the number of terms with degree \(d\) blows up rapidly as \(d\) grows larger. Given \(k\) variables, the number of monomials of degree \(d\) is \({k - 1 + d} \choose {k - 1}\). Even small changes in degree, say from \(2\) to \(3\), dramatically increase the complexity of our model. Thus we often need a more fine-grained tool for adjusting function complexity.

Weight decay (commonly called L2 regularization), might be the most widely-used technique for regularizing parametric machine learning models. The technique is motivated by the basic intuition that among all functions \(f\), the function \(f = 0\) (assigning the value \(0\) to all inputs) is in some sense the simplest, and that we... But how precisely should we measure the distance between a function and zero? There is no single right answer. In fact, entire branches of mathematics, including parts of functional analysis and the theory of Banach spaces, are devoted to answering this issue. One simple interpretation might be to measure the complexity of a linear function \(f(\mathbf{x}) = \mathbf{w}^\top \mathbf{x}\) by some norm of its weight vector, e.g., \(|| \mathbf{w} ||^2\).

The most common method for ensuring a small weight vector is to add its norm as a penalty term to the problem of minimizing the loss. Thus we replace our original objective, minimize the prediction loss on the training labels, with new objective, minimize the sum of the prediction loss and the penalty term. Now, if our weight vector grows too large, our learning algorithm might focus on minimizing the weight norm \(|| \mathbf{w} ||^2\) versus minimizing the training error. That is exactly what we want. To illustrate things in code, let us revive our previous example from Section 3.1 for linear regression. There, our loss was given by

3 7 Weight Decay Dive Into Deep Learning 1 0 3 Documentation D2l

People Also Search

Now That We Have Characterized The Problem Of Overfitting, We

However, Simply Tossing Aside Features Can Be Too Blunt An

Given \(k\) Variables, The Number Of Monomials Of Degree \(d\)

But How Precisely Should We Measure The Distance Between A

Thus We Replace Our Original Objective, Minimizing The Prediction Loss