Attention An Alchemist S Notes On Deep Learning

Leo Migdal

-Nov 17, 2025, 3:16 AM

attention an alchemist s notes on deep learning

Attention is an operator which communicates information between a set of feature tokens. In many cases, an object is best represented as a set of features. For example, a sentence is a set of words, and an image is a set of visual patches. The attention operator gives us a way to condition over these features, which we will refer to as tokens. A typical intermediate layer for an attention-based neural network has the shape [batch, num_tokens, num_features], as opposed to the typical [batch, num_features]. By structuring our computation in terms of tokens, we can use the same parameter-sharing philosophy from convolutional and recurrent layers.

Attention shares parameters across a set of tokens in parallel. The attention operator produces a new token for every input token, and each output token is a function of all other tokens. A naive thing we could do is learn a single dense layer, apply it to every token, then sum up those results. But we run into an issue – some of the other tokens are relevant, but most are not. We would like a way to selectively condition on only the relevant tokens. Instead, we will accomplish this selective conditioning by using a learned function to decide how much ‘attention’ to pay to each token.

We use a dense layer to generate key vectors for each token. We then learn a query vector for each token as well. The attention weighting can now be calculated as a dot product between the keys and queries. Each token is now summed via the attention-weighting to get the final output vector. The attention mechanism has revolutionized deep learning, particularly in natural language processing (NLP) and computer vision. By mimicking the way humans pay selective attention to certain parts of information, it allows neural networks to dynamically focus on relevant features while processing complex data.

First introduced in 2014 for machine translation, attention has now become the foundation of transformer-based models that power applications like ChatGPT, BERT, and Vision Transformers. Traditional sequence models like RNNs and LSTMs process data step by step and struggle with: Attention solves these problems by allowing the model to selectively weigh different parts of the input during output generation. At its core, attention computes a weighted sum of input features, where the weights reflect how important each input token is to a specific output token. The Attention Mechanism in Machine Learning is a technique that allows models to focus on the most important parts of input data when making predictions. It assigns different weights to different elements hence helping the model prioritize relevant information instead of treating all inputs equally.

It forms the foundation of advanced models like Transformers and BERT and is widely used in Natural Language Processing (NLP) and Computer Vision. To know more about the types of attention mechanism read: Types of Attention Mechanism. The working of attention mechanism can be broken down into several key steps: Step 1: Input Encoding: The input sequence is first encoded using an encoder like RNN, LSTM, GRU or Transformer to generate hidden states representing the input context. Step 2: Query, Key and Value Vectors: Each input is transformed into: In the last chapter, you and I started to step through the internal workings of a transformer, the key piece of technology inside large language models.

Transformers first hit the scene in a (now-famous) paper called Attention is All You Need, and in this chapter you and I will dig into what this attention mechanism is, by visualizing how it... Many people find the attention mechanism confusing, so before we dive into all the computational details and matrix multiplications, it's worth thinking about a couple of examples for the kind of behavior that we... You and I both know that the word mole in each of these sentences has a different meaning that's based on the context. However, after the first step of a transformer - the embedding step that associates each token with a vector - the vector that's associated with the word mole would be the same in all... It's only in the next step of the transformer, the attention block, when the surrounding embeddings have the chance to pass information into the mole embedding and update its values. Attention mechanism is a fundamental invention in artificial intelligence and machine learning, redefining the capabilities of deep learning models.

This mechanism, inspired by the human mental process of selective focus, has emerged as a pillar in a variety of applications, accelerating developments in natural language processing, computer vision, and beyond. Imagine if machines could pay attention selectively, the way we do, focusing on critical features in a vast amount of data. This is the essence of the attention mechanism, a critical component of today’s deep learning models. This article will take you on a journey to learn about the heart, growth, and enormous consequences of attention mechanisms in deep learning. We’ll look at how they function, from the fundamentals to their game-changing impact in several fields. Attention mechanism is a technique used in deep learning models that allows the model to selectively focus on specific areas of the input data when making predictions.

This is very helpful when working with extensive data sequences, like in natural language processing or computer vision tasks. In many deep learning contexts (machine translation, text summarization, sequence processing), models must handle variable-length inputs and focus on certain parts more than others. The attention mechanism allows the model to give more weight to certain elements of a sequence when computing an output, depending on their relevance. “An apple that had been on the tree in the garden for weeks had finally been picked up.” “Une pomme qui était sur l’arbre du jardin depuis des semaines avait finalement été ramassée.” Here, to correctly spell the word ramassée, one must be aware that it refers to the noun une pomme, which is feminine.

I have recently had the opportunity to spend lots of time learning with the excuse of pursing a PhD. These Alchemist’s Notes are a byproduct of that process. Each page contains notes and ideas related broadly to deep learning, generative modelling, and practical engineering. I’ve actually been writing these things since 2016 on my personal website, but this site should be a more put-together version. Who is this site for? For you, I hope.

The ideal reader has at least an undergraduate-level understanding of machine learning, and is comfortable with Python. The rest you can figure out as you go. What do the contents look like? The main goal of these notes is to provide definitions and examples. I have found these to be the most critical bits to convey when introducing new concepts. Each page will give a brief overview and example implementation, then we will mostly be answering common questions.

Wherever possible, we will utilize concrete code examples. This website is compiled from a set of Jupyter notebooks, so you can go and play through the code on every page. (Click the ‘Colab’ button on the top-right). What’s with ‘Alchemy’ in the name? In deep learning, we have not arrived at a unifying theory. What we do have are snippets of evidence and intutions, insights from mathematical foundations, and a rich body of literature and open-source code.

Yet, it is an open question how all these ideas should come together. Deep learning is still in the alchemical age, and even well-tested techniques should be seen as a reference guide and not a ground-truth solution. Perhaps you will come to your own conclusions. These notes are not finished. I am planning to continuously update it, but no guarantees that the content will stay on track. If you see any issues or have suggestions, send me a message at kvfrans@berkeley.edu, or submit an issue to the repo.

Sarah Lee AI generated Llama-4-Maverick-17B-128E-Instruct-FP8 9 min read · May 26, 2025 The attention mechanism has revolutionized the field of deep learning, enabling models to focus on the most relevant parts of the input data and improving their performance on a wide range of tasks. In this article, we will provide a comprehensive overview of the attention mechanism, its types, and its applications in deep learning. We will also discuss how to implement attention in different deep learning models, advanced topics in attention research, and future directions. The attention mechanism is a technique used in deep learning models to selectively concentrate on specific parts of the input data. There are several types of attention mechanisms, each with its strengths and weaknesses.

Soft attention and hard attention are two fundamental types of attention mechanisms. Soft attention computes a weighted average of the input elements, where the weights are learned during training. Hard attention, on the other hand, selects a subset of the input elements and ignores the rest. The key difference between soft and hard attention is that soft attention is differentiable, making it easier to train using backpropagation. Hard attention, however, is non-differentiable and requires more complex training methods, such as reinforcement learning. The Transformer is the main architecture of choice today, combining residual connections and attention.

We will implement it in 20 lines of code. Transformers are domain-agnostic and can be applied to text, images, video, etc. The Transformer architecture is closely intertwined with the attention operator. Researchers working on natural language translation found that augmenting a traditional recurrent network with attention layers could increase accuracy. Later, it was found that attention was so effective, the recurrent connections could be dropped entirely – hence the title “Attention is all you need” in the original Transformer paper. Today, transformers are used not only in langauge, but across the board in image, video, robotics, and so on.

The core of a transformer is a residual network, where each intermediate activation is a set of feature tokens. The residual blocks comprise of a self-attention layer, in which information can be shared within the set of tokens, as well as dense layers that operate independently on each token in the set. The specific details of residual blocks vary between kinds of transformer models. We will describe the GPT-2 architecture here. In GPT-2, each residual block consists of: Layer norm on the residual stream vectors.

This article provides a structured mathematical explanation of attention mechanisms in deep learning, focusing on their application in transformer architectures. We’ll explore how sequences are processed through attention layers and understand the mathematical foundations of these powerful neural network components. Before diving into the mathematics, let’s establish our key concepts: The transformation of text into numerical representations involves several steps: We convert it to token IDs using subword tokenization: The embedding process transforms discrete tokens into continuous vectors:

Attention An Alchemist S Notes On Deep Learning

People Also Search

Attention Is An Operator Which Communicates Information Between A Set

Attention Shares Parameters Across A Set Of Tokens In Parallel.

We Use A Dense Layer To Generate Key Vectors For

First Introduced In 2014 For Machine Translation, Attention Has Now

It Forms The Foundation Of Advanced Models Like Transformers And