An Alchemist S Notes On Deep Learning

Leo Migdal

-Nov 17, 2025, 1:07 AM

I have recently had the opportunity to spend lots of time learning with the excuse of pursing a PhD. These Alchemist’s Notes are a byproduct of that process. Each page contains notes and ideas related broadly to deep learning, generative modelling, and practical engineering. I’ve actually been writing these things since 2016 on my personal website, but this site should be a more put-together version. Who is this site for? For you, I hope.

The ideal reader has at least an undergraduate-level understanding of machine learning, and is comfortable with Python. The rest you can figure out as you go. What do the contents look like? The main goal of these notes is to provide definitions and examples. I have found these to be the most critical bits to convey when introducing new concepts. Each page will give a brief overview and example implementation, then we will mostly be answering common questions.

Wherever possible, we will utilize concrete code examples. This website is compiled from a set of Jupyter notebooks, so you can go and play through the code on every page. (Click the ‘Colab’ button on the top-right). What’s with ‘Alchemy’ in the name? In deep learning, we have not arrived at a unifying theory. What we do have are snippets of evidence and intutions, insights from mathematical foundations, and a rich body of literature and open-source code.

Yet, it is an open question how all these ideas should come together. Deep learning is still in the alchemical age, and even well-tested techniques should be seen as a reference guide and not a ground-truth solution. Perhaps you will come to your own conclusions. These notes are not finished. I am planning to continuously update it, but no guarantees that the content will stay on track. If you see any issues or have suggestions, send me a message at kvfrans@berkeley.edu, or submit an issue to the repo.

Yet, it is an open question how all these ideas should come together. Deep learning is still in the alchemical age, and even well-tested techniques should be seen as a reference guide and not a ground-truth solution. Perhaps you will come to your own conclusions. The backbone of modern learning is gradient descent. We all know the pain of waiting for a model to train. So you can imagine a classic rite of passage is for researchers to think about ways to improve optimization.

The current champion is Adam, however, a familiy of work has been building that claims to outperform Adam at the Pareto frontier of compute. In this post, we will explore the flavors of such optimizers, which we will refer to as spectral-whitening methods. Do such methods reliably outperform Adam? If so, in which ways do the various flavors have pros and cons? When we calculate a gradient, we get a direction to adjust model parameters to reduce loss. But this gradient is only accurate in a local neighborhood.

So we typicall take a small step in that direction, then re-calculate before moving again. This notion can be formalized by framing each step of gradient descent as solving the following distance-penalized problem: where \(g = \nabla_\theta L(\theta,x)\). Traditionally, we assume a Euclidean distance over parameters, in which case the solution (as shown above) is simply the gradient scaled by a constant learning-rate factor \(\alpha\). However, the Euclidean distance is an assumption, and is often suboptimal. Certain parameters may be more sensitive to higher-order changes than others, and thus should be assigned a larger penalty.

We can generally represent second-order distances using a metric matrix \(M\), under which the distance of an update can be expressed as the matrix product: When we use \(M\) as the distance metric, the solution then becomes: The Transformer is the main architecture of choice today, combining residual connections and attention. We will implement it in 20 lines of code. Transformers are domain-agnostic and can be applied to text, images, video, etc. The Transformer architecture is closely intertwined with the attention operator.

Researchers working on natural language translation found that augmenting a traditional recurrent network with attention layers could increase accuracy. Later, it was found that attention was so effective, the recurrent connections could be dropped entirely – hence the title “Attention is all you need” in the original Transformer paper. Today, transformers are used not only in langauge, but across the board in image, video, robotics, and so on. The core of a transformer is a residual network, where each intermediate activation is a set of feature tokens. The residual blocks comprise of a self-attention layer, in which information can be shared within the set of tokens, as well as dense layers that operate independently on each token in the set. The specific details of residual blocks vary between kinds of transformer models.

We will describe the GPT-2 architecture here. In GPT-2, each residual block consists of: Layer norm on the residual stream vectors. Attention is an operator which communicates information between a set of feature tokens. In many cases, an object is best represented as a set of features. For example, a sentence is a set of words, and an image is a set of visual patches.

The attention operator gives us a way to condition over these features, which we will refer to as tokens. A typical intermediate layer for an attention-based neural network has the shape [batch, num_tokens, num_features], as opposed to the typical [batch, num_features]. By structuring our computation in terms of tokens, we can use the same parameter-sharing philosophy from convolutional and recurrent layers. Attention shares parameters across a set of tokens in parallel. The attention operator produces a new token for every input token, and each output token is a function of all other tokens. A naive thing we could do is learn a single dense layer, apply it to every token, then sum up those results.

But we run into an issue – some of the other tokens are relevant, but most are not. We would like a way to selectively condition on only the relevant tokens. Instead, we will accomplish this selective conditioning by using a learned function to decide how much ‘attention’ to pay to each token. We use a dense layer to generate key vectors for each token. We then learn a query vector for each token as well. The attention weighting can now be calculated as a dot product between the keys and queries.

Each token is now summed via the attention-weighting to get the final output vector. Activation functions are elementwise functions which allow networks to learn nonlinear behavior. Between dense layers, we place an activation function to transform the features elementwise. They are also known as nonlinearities as the purpose of an activation function is to break the linear relationship defined by the dense layers. The first activation functions were proposed from a biologically inspired perspective, and aimed to model the ‘spiking’ behavior of biological neurons. The sigmoid function squashes inputs to a range between (0,1), and tanh (hyperbolic tangent) squashes between (1,1).

Squashing activations are good for ensuring numerical stability, since we know the magnitude of the outputs will always be constrained. However, when the inputs are too large in magnitude, the gradient of a squashing function will approach zero. This creates the vanishing gradient problem where a neural network will have a hard time improving when features become too large. In an effort to combat the vanishing gradient problem, the next class of activations instead model a piecewise nonlinearity. The simplest of these is the rectified linear unit (ReLU) which simply takes the form: The ReLU lets positive values through without change, and clips negative values to zero.

An Alchemist S Notes On Deep Learning

People Also Search

I Have Recently Had The Opportunity To Spend Lots Of

The Ideal Reader Has At Least An Undergraduate-level Understanding Of

Wherever Possible, We Will Utilize Concrete Code Examples. This Website

Yet, It Is An Open Question How All These Ideas

I Have Recently Had The Opportunity To Spend Lots Of