Exploring Beyond Regular Transformers By Eren Gölge Medium

Leo Migdal

-Nov 18, 2025, 12:30 AM

exploring beyond regular transformers by eren gölge medium

Delving deep into Transformer variants for Language Models, assessing their efficiency and performance. https://lnkd.in/eNYY9jVu Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users. Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

🐸💬 - a deep learning toolkit for Text-to-Speech, battle-tested in research and production Simple but maybe too simple config management through python data classes. We use it for machine learning. 🐸STT - The deep learning toolkit for Speech-to-Text. Training and deploying STT models has never been so easy. In this article, I will explore various alternatives to transformers, considering their architectural improvements, computational efficiency, and performance results across different benchmarks.

I intend to continually update this post with new models in the future. If you believe there are any models or important points that should be included or any corrections that need to be made, please feel free to reach out. Traditional sequential models, like recurrent neural networks (RNNs) and long short-term memory networks (LSTMs), face challenges in effectively capturing long-range dependencies and parallelizing computations. The Transformer architecture addresses these issues by relying on self-attention mechanisms. At the core of the Transformer is the self-attention mechanism. Unlike traditional approaches, where each element in a sequence is processed one at a time, self-attention allows the model to weigh the importance of different elements relative to each other.

This enables capturing relationships between distant words in a sentence. Thanks for reading Machine Learns Substack! Subscribe for free to receive new posts and support my work. Transformer has some limitations and constraints in terms of computation and storage. The Transformer is based on dot-product attention that computes softmax(Q*K.t), which is computationally heavy, and it needs to store a KV cache that is also heavy in memory at inference. This is a limiting factor, especially in problems with extended context sizes.

Transformers' space complexity increases quadratically with the increasing context size. RWKV is RNN met Transformer with comparable performance to SOTA Transformer models and subquadratic runtime. It is also open sourced with pretrained models I was following it from its emergence on Reddit ML. Great to see the progress. https://t.co/Odte55iKPU 🤖 AI has won two Nobel Prizes.

Hopfield Networks got the Physics prize, while AlphaFold won in Chemistry. AlphaFold was only released in 2018. So it took only 6 years to receive the prize. Let’s start with a short intro to both concepts. Hopfield Network is like a web of connected dots. Every dot links to all the others.

You show it some patterns during training, and it remembers them. Later, if you give it part of a pattern, it can figure out the whole thing. It's as if the network says, "Oh, I've seen something like this before!" This ability is called Associative Memory. Hopfield Networks learn differently from today's AI models. They use Hebbian Learning, which is about strengthening connections between nodes that fire together. It's simpler than the complex math modern AI uses to learn.

AlphaFold is like a super-smart puzzle solver for proteins. Proteins are tiny machines in our bodies, and their shape is crucial for how they work. AlphaFold tries to predict these shapes.

Exploring Beyond Regular Transformers By Eren Gölge Medium

People Also Search

Delving Deep Into Transformer Variants For Language Models, Assessing Their

🐸💬 - A Deep Learning Toolkit For Text-to-Speech, Battle-tested In

I Intend To Continually Update This Post With New Models

This Enables Capturing Relationships Between Distant Words In A Sentence.

Transformers' Space Complexity Increases Quadratically With The Increasing Context Size.