Why ChatGPT is Possible: How Transformers Rebuilt the Modern World
Sequence transduction, such as language translation, relies heavily on mapping an sequence of inputs, , to an output sequence . For years, the dominant models for these tasks were based on complex Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) [1].
However, in 2017, the paper "Attention Is All You Need" proposed a radically different approach: dispensing with recurrence and convolutions entirely, and relying solely on Attention Mechanisms.
In this article, we will explore the limitations of the Transformer's predecessors, and why self-attention became the breakthrough that would go on to shape modern AI.
We will use a consistently traced running example to see how different networks process it: the input sequence "The cat sat."
1. The Sequential Bottleneck of RNNs and LSTMs
Recurrent Neural Networks, including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), were long established as the state of the art in sequence modeling.
RNNs intrinsically process tokens sequentially. For an input sequence like "The cat sat.", the network generates a sequence of hidden states , as a function of the previous hidden state and the current input .
While LSTMs excel at retaining historical context, their sequential nature creates a major bottleneck: they cannot be parallelized. To compute for "sat", the model must wait for the completion of for "The" and for "cat".
Let's visualize this dependency using our running example. We assume 4 embedding tokens: ["The", "cat", "sat", "."].
*Figure 1: Sequential processing of an RNN. Note the green dependencies—each step mathematically requires the completion of the previous step's hidden state .*
Let's look at the mathematical steps for the token "sat" ():
- Input: arrives.
- Formula:
- Worked Example: Assume are identity weight matrices for simplicity.
- Output: We get .
As we can clearly see in step 3, calculating is strictly dependent on completing operations for . We cannot start calculating "sat" before "cat" is completely finished. For very long sequences, this strict sequential pipeline prohibits computation from running in parallel, making training agonizingly slow and creating the notorious Vanishing Gradient problem, where information from distant earlier tokens ("The") dissipates.
2. Convolutional Routes: Parallel, but Local
To combat the sequential lack of parallelization, models like Convolutional Sequence to Sequence (ConvS2S) and ByteNet turned to Convolutional Neural Networks (CNNs).
CNNs can process all tokens at once. Using a kernel (e.g., size ), they slide over the input sequence embedding in parallel.
*Figure 2: 1D Convolution over the sequence. Each output only looks at local tokens. Computation happens simultaneously.*
Let's see the math for the CNN's output :
- Input: A kernel of size looks at ("The") and ("cat") simultaneously.
- Formula:
- Worked Example:
- Output: We get .
Because computation doesn't rely on the output of any other , we can compute and at the exact same time on GPU hardware.
While this solved the parallelization issue of RNNs, it introduced a severe architectural weakness: restricted receptive fields.
The representation only contains information from "The" and "cat". If we want to capture distant dependencies (e.g., agreeing a pronoun at token 50 with a subject at token 1), we have to stack many convolutional layers linearly or logarithmically (using dilated convolutions). The number of operations to connect distant positions grows significantly, making it complex to learn long-range dependencies [1].
3. The Bridge: Bahdanau Attention Mechanisms
Before the entire architecture was overhauled, the Attention Mechanism was born as an addition to RNN models for machine translation [2].
In a traditional Sequence-to-Sequence (Seq2Seq) model, an encoder compresses the entire sentence into a single fixed-length context vector, and the decoder generates the output from it. However, squashing a 20-word sentence into one vector loses massive amounts of detail.
Bahdanau Attention solved this by letting the decoder look back at all the encoder's hidden states dynamically, weighing how relevant each input word is for the current token being decoded.
Where the attention weights are computed via an alignment model (often a small feed-forward network).
This dynamic weighting solved the information bottleneck. But the base model was still an RNN, suffering from sequential slowdowns!
4. The Origin of Transformers: Self-Attention
In "Attention Is All You Need", researchers realized something profound: If Attention is so powerful at bridging distant words between encoder and decoder, why not use it to connect words within the same sequence, entirely removing the RNN/CNN?
This gave birth to Self-Attention.
In Self-Attention, every token looks at every other token in the sequence simultaneously to compute its own representation. The computational distance between any two tokens is reduced to .
How does it work? By creating three vectors for each token: a Query (), a Key (), and a Value (). Let's examine this step-by-step with our running example.
First, linear projections generate three different vectors from the same input embedding for every token:
- Query (): "What am I looking for?"
- Key (): "What do I contain?"
- Value (): "If you match my Key, here is the actual information I provide."
The famous Scaled Dot-Product Attention equation calculates the interaction:
Let's assume simple 2-dimensional embeddings (). When token 2 ("cat") generates its Query , we dot product it against all Keys to see who matches:
Applying softmax normalizes these scores into probabilities (attention weights):
*Figure 3: Self-Attention weights matrix for "The cat sat.". Each row represents the token acting as the Query trying to find relevant Keys across the entire sequence.*
As visualised in Figure 3, the token "cat" focuses its attention highest on itself () and its predicate verb "sat" ().
Finally, let's look at how the exact representation for "cat" is outputted using the Values ().
- Input: The attention weights for "cat":
[0.21, 0.28, 0.27, 0.24], and the sequence's Value vectors (assume they equal the input embeddings for simplicity: , etc). - Formula:
- Worked Example:
- Output: After calculation,
By summing these weighted vectors, the model produces a mathematically unified representation () that has contextualized the presence of "sat" instantaneously—with no sequential dependency (unlike RNNs) and no local window restriction (unlike CNNs). The distance from any word to any other word is exactly 1 step! That is why the Transformer is vastly superior.
5. Multi-Head Attention: Looking from Multiple Angles
A single attention operation might overly focus on one type of relationship (e.g., adjacent verbs). To combat this, Transformers deploy Multi-Head Attention. Instead of calculating attention once, the model calculates it times in parallel across different subspaces (by using different projection weights for ).
This parallel calculation acts similarly to multiple feature channels in CNNs but without the local window restriction. Each head looks at different grammatical/semantic structures dynamically.
6. Architecture of the Transformer
Because Self-Attention operates universally and instantaneously over an entire sequence, the model natively has no idea about token order. The word "cat" at position 2 or position 100 looks identical to the attention mechanism.
To fix this, the original paper introduced Positional Encodings—sine and cosine frequencies added to the input embeddings so that identical words have distinctly unique mathematical fingerprints depending on their exact position.
With self-attention replacing recurrence and convolution, Vaswani et al. [1] unified it into the full Sequence-to-Sequence architecture:
- Encoder Layer: A stack of Multi-Head Self-Attention layers followed by point-wise Feed-Forward Networks.
- Decoder Layer: Similar to the encoder, but uses Masked Self-Attention (so it cannot see the future target words during translation text generation) and a second Encoder-Decoder Attention mechanism, where Queries come from the decoder and Keys/Values come from the encoder's final output.
7. A Permanent Shift in AI
By discarding sequential dependency (RNNs) and local reception limits (CNNs), the Transformer achieved two massive global wins:
- O(1) global distance: Every word is exactly one mathematical step away from every other word.
- Infinite parallelization: Matrix operations can calculate an entire sequence simultaneously. The massive reduction in training times fundamentally allowed researchers to scale networks to millions, then billions of parameters—unlocking the dawn of Large Language Models (LLMs) like GPT [3,4], BERT, and PaLM.
"Attention Is All You Need" did exactly what its title implied: it stripped away the convolution and recurrence wrappers, proving that self-attention alone was a vastly superior engine.