Azhary Arliansyah

Articles / Why ChatGPT is Possible: How Transformers Rebuilt the Modern World

Why ChatGPT is Possible: How Transformers Rebuilt the Modern World

nlp transformer machine learning history
Why ChatGPT is Possible: How Transformers Rebuilt the Modern World

Sequence transduction, such as language translation, relies heavily on mapping an sequence of inputs, (x1,,xn)(x_1, \dots, x_n), to an output sequence (y1,,ym)(y_1, \dots, y_m). For years, the dominant models for these tasks were based on complex Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) [1].

However, in 2017, the paper "Attention Is All You Need" proposed a radically different approach: dispensing with recurrence and convolutions entirely, and relying solely on Attention Mechanisms.

In this article, we will explore the limitations of the Transformer's predecessors, and why self-attention became the breakthrough that would go on to shape modern AI.

We will use a consistently traced running example to see how different networks process it: the input sequence "The cat sat."

1. The Sequential Bottleneck of RNNs and LSTMs

Recurrent Neural Networks, including Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU), were long established as the state of the art in sequence modeling.

RNNs intrinsically process tokens sequentially. For an input sequence like "The cat sat.", the network generates a sequence of hidden states hth_t, as a function of the previous hidden state ht1h_{t-1} and the current input xtx_t.

ht=f(ht1,xt)h_t = f(h_{t-1}, x_t)

While LSTMs excel at retaining historical context, their sequential nature creates a major bottleneck: they cannot be parallelized. To compute h3h_3 for "sat", the model must wait for the completion of h1h_1 for "The" and h2h_2 for "cat".

Let's visualize this dependency using our running example. We assume 4 embedding tokens: ["The", "cat", "sat", "."].

The [0.4, -0.1] cat [0.8, 0.3] sat [0.3, 0.7] . [-0.5, 0.2]
h1h_1
[0.2, 0.5]
h2h_2
[-0.1, 0.9]
h3h_3
[0.6, -0.2]
h4h_4
[0.1, 0.8] Dependency To Decoder (Optionally)

*Figure 1: Sequential processing of an RNN. Note the green dependencies—each step mathematically requires the completion of the previous step's hidden state ht1h_{t-1}.*

Let's look at the mathematical steps for the token "sat" (x3x_3):

  1. Input: x3=[0.3,0.7]x_3 = [0.3, 0.7] arrives.
  2. Formula: h3=tanh(Wxx3+Whh2+b)h_3 = \tanh(W_x x_3 + W_h h_2 + b)
  3. Worked Example: Assume Wx,WhW_x, W_h are identity weight matrices for simplicity.
    h3activation([0.3,0.7]+[0.1,0.9])h_3 \approx \text{activation}([0.3, 0.7] + [-0.1, 0.9])
  4. Output: We get h3=[0.6,0.2]h_3 = [0.6, -0.2].

As we can clearly see in step 3, calculating h3=[0.6,0.2]h_3 = [0.6, -0.2] is strictly dependent on completing operations for h2=[0.1,0.9]h_2 = [-0.1, 0.9]. We cannot start calculating "sat" before "cat" is completely finished. For very long sequences, this strict sequential pipeline prohibits computation from running in parallel, making training agonizingly slow and creating the notorious Vanishing Gradient problem, where information from distant earlier tokens ("The") dissipates.

2. Convolutional Routes: Parallel, but Local

To combat the sequential lack of parallelization, models like Convolutional Sequence to Sequence (ConvS2S) and ByteNet turned to Convolutional Neural Networks (CNNs).

CNNs can process all tokens at once. Using a kernel (e.g., size k=2k=2), they slide over the input sequence embedding in parallel.

The cat sat . Kernel (k=2)
c1c_1
[0.3, 0.4]
c2c_2
[-0.2, 0.6] Can be computed parallelly!

*Figure 2: 1D Convolution over the sequence. Each output only looks at kk local tokens. Computation happens simultaneously.*

Let's see the math for the CNN's output c1c_1:

  1. Input: A kernel of size k=2k=2 looks at x1x_1 ("The") and x2x_2 ("cat") simultaneously.
  2. Formula: ci=activation(W[xi;xi+1]+b)c_i = \text{activation}(W * [x_i ; x_{i+1}] + b)
  3. Worked Example:
    c1W[[0.4,0.1];[0.8,0.3]]c_1 \approx W * [[0.4, -0.1] ; [0.8, 0.3]]
  4. Output: We get c1=[0.3,0.4]c_1 = [0.3, 0.4].

Because c1c_1 computation doesn't rely on the output of any other cic_i, we can compute c1c_1 and c2c_2 at the exact same time on GPU hardware.

While this solved the parallelization issue of RNNs, it introduced a severe architectural weakness: restricted receptive fields.

The representation c1c_1 only contains information from "The" and "cat". If we want to capture distant dependencies (e.g., agreeing a pronoun at token 50 with a subject at token 1), we have to stack many convolutional layers linearly or logarithmically (using dilated convolutions). The number of operations to connect distant positions grows significantly, making it complex to learn long-range dependencies [1].

3. The Bridge: Bahdanau Attention Mechanisms

Before the entire architecture was overhauled, the Attention Mechanism was born as an addition to RNN models for machine translation [2].

In a traditional Sequence-to-Sequence (Seq2Seq) model, an encoder compresses the entire sentence into a single fixed-length context vector, and the decoder generates the output from it. However, squashing a 20-word sentence into one vector loses massive amounts of detail.

Bahdanau Attention solved this by letting the decoder look back at all the encoder's hidden states hjh_j dynamically, weighing how relevant each input word is for the current token being decoded.

ci=j=1Txαijhjc_i = \sum_{j=1}^{T_x} \alpha_{ij} h_j

Where the attention weights αij\alpha_{ij} are computed via an alignment model (often a small feed-forward network).

This dynamic weighting solved the information bottleneck. But the base model was still an RNN, suffering from sequential slowdowns!

4. The Origin of Transformers: Self-Attention

In "Attention Is All You Need", researchers realized something profound: If Attention is so powerful at bridging distant words between encoder and decoder, why not use it to connect words within the same sequence, entirely removing the RNN/CNN?

This gave birth to Self-Attention.

In Self-Attention, every token looks at every other token in the sequence simultaneously to compute its own representation. The computational distance between any two tokens is reduced to O(1)\mathcal{O}(1).

How does it work? By creating three vectors for each token: a Query (QQ), a Key (KK), and a Value (VV). Let's examine this step-by-step with our running example.

First, linear projections generate three different vectors from the same input embedding for every token:

  • Query (QQ): "What am I looking for?"
  • Key (KK): "What do I contain?"
  • Value (VV): "If you match my Key, here is the actual information I provide."

The famous Scaled Dot-Product Attention equation calculates the interaction:

Attention(Q,K,V)=softmax(QKTdk)V\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Let's assume simple 2-dimensional embeddings (dk=2d_k=2). When token 2 ("cat") generates its Query Qcat=[0.0,0.5]Q_{cat} = [0.0, 0.5], we dot product it against all Keys to see who matches:

  1. QcatKthe=[0.0,0.5][0.5,0.1]=0.05Q_{cat} \cdot K_{the} = [0.0, 0.5] \cdot [0.5, -0.1] = -0.05
  2. QcatKcat=[0.0,0.5][0.1,0.5]=0.25Q_{cat} \cdot K_{cat} = [0.0, 0.5] \cdot [-0.1, 0.5] = 0.25
  3. QcatKsat=[0.0,0.5][0.1,0.4]=0.20Q_{cat} \cdot K_{sat} = [0.0, 0.5] \cdot [0.1, 0.4] = 0.20
  4. QcatK.=[0.0,0.5][0.2,0.2]=0.10Q_{cat} \cdot K_{.} = [0.0, 0.5] \cdot [0.2, 0.2] = 0.10

Applying softmax normalizes these scores into probabilities (attention weights):

Attention Weights (K) TheQ: ThecatQ: catsatQ: sat.Q: . 0.290.220.240.250.210.280.270.240.210.280.270.240.250.250.250.25

*Figure 3: Self-Attention weights matrix for "The cat sat.". Each row represents the token acting as the Query trying to find relevant Keys across the entire sequence.*

As visualised in Figure 3, the token "cat" focuses its attention highest on itself (0.280.28) and its predicate verb "sat" (0.270.27).

Finally, let's look at how the exact representation for "cat" is outputted using the Values (VV).

  1. Input: The attention weights for "cat": [0.21, 0.28, 0.27, 0.24], and the sequence's Value vectors (assume they equal the input embeddings for simplicity: Vthe=[0.4,0.1]V_{the}=[0.4, -0.1], etc).
  2. Formula: zcat=αcat,jVjz_{cat} = \sum \alpha_{cat, j} V_j
  3. Worked Example:
    zcat=0.21×[0.4,0.1]+0.28×[0.8,0.3]+0.27×[0.3,0.7]+0.24×[0.5,0.2]z_{cat} = 0.21 \times [0.4, -0.1] + 0.28 \times [0.8, 0.3] + 0.27 \times [0.3, 0.7] + 0.24 \times [-0.5, 0.2]
  4. Output: After calculation, zcat=[0.269,0.3]z_{cat} = [0.269, 0.3]

By summing these weighted vectors, the model produces a mathematically unified representation (zcatz_{cat}) that has contextualized the presence of "sat" instantaneously—with no sequential dependency (unlike RNNs) and no local window restriction (unlike CNNs). The distance from any word to any other word is exactly 1 step! That is why the Transformer is vastly superior.

5. Multi-Head Attention: Looking from Multiple Angles

A single attention operation might overly focus on one type of relationship (e.g., adjacent verbs). To combat this, Transformers deploy Multi-Head Attention. Instead of calculating attention once, the model calculates it hh times in parallel across different subspaces (by using different projection weights for Q,K,VQ, K, V).

This parallel calculation acts similarly to multiple feature channels in CNNs but without the local window restriction. Each head looks at different grammatical/semantic structures dynamically.

6. Architecture of the Transformer

Because Self-Attention operates universally and instantaneously over an entire sequence, the model natively has no idea about token order. The word "cat" at position 2 or position 100 looks identical to the attention mechanism.

To fix this, the original paper introduced Positional Encodings—sine and cosine frequencies added to the input embeddings so that identical words have distinctly unique mathematical fingerprints depending on their exact position.

With self-attention replacing recurrence and convolution, Vaswani et al. [1] unified it into the full Sequence-to-Sequence architecture:

  1. Encoder Layer: A stack of Multi-Head Self-Attention layers followed by point-wise Feed-Forward Networks.
  2. Decoder Layer: Similar to the encoder, but uses Masked Self-Attention (so it cannot see the future target words during translation text generation) and a second Encoder-Decoder Attention mechanism, where Queries come from the decoder and Keys/Values come from the encoder's final output.

7. A Permanent Shift in AI

By discarding sequential dependency (RNNs) and local reception limits (CNNs), the Transformer achieved two massive global wins:

  1. O(1) global distance: Every word is exactly one mathematical step away from every other word.
  2. Infinite parallelization: Matrix operations can calculate an entire sequence simultaneously. The massive reduction in training times fundamentally allowed researchers to scale networks to millions, then billions of parameters—unlocking the dawn of Large Language Models (LLMs) like GPT [3,4], BERT, and PaLM.

"Attention Is All You Need" did exactly what its title implied: it stripped away the convolution and recurrence wrappers, proving that self-attention alone was a vastly superior engine.

References

1. Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N and Kaiser, Lukasz and Polosukhin, Illia (2017). Attention is all you need. Advances in neural information processing systems.
2. Bahdanau, Dzmitry and Cho, Kyunghyun and Bengio, Yoshua (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
3. Brown, Tom and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared D and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and others (2020). Language models are few-shot learners. Advances in neural information processing systems.
4. OpenAI (2023). GPT-4 Technical Report. arXiv preprint arXiv:2303.08774.