ai mlintermediate 12m2026-06-09

Transformers Part 2 — Positional Encoding, RoPE, MLP, LayerNorm

Session 6 of the 48-session learning series.

Date: Sun, 2026-06-14 · Time: 09:00–11:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 01 · Est. read: 2 h

Why this session matters

This is Session 06 of 48 in the LLMs & Agents track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

Why position has to be injected at all (attention is permutation-invariant)
Three eras of positional encoding — sinusoidal, learned, RoPE
RoPE math — rotate Q/K pairs, relative position emerges from QᵀK
The MLP — 4× expansion, GELU/SwiGLU, where the facts live
KV cache anatomy — what actually grows at inference time

Pre-read (skim before the session)

Self-attention is a set operation. Permute the inputs and the outputs permute the same way — but the content of each output doesn't change. A transformer without position info would see "dog bites man" and "man bites dog" as identical bags of words.

We need to inject position. Three eras of how:

2. Era 1 — sinusoidal (original "Attention Is All You Need")

Each position gets a deterministic vector built from sines and cosines at log-spaced frequencies:

PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Added (not concatenated) to the token embedding. Properties:

Deterministic — no parameters to train.
Periodic — each frequency repeats; combinations give unique position vectors up to a long horizon.
Extrapolates poorly to lengths beyond what was trained on.

3. Era 2 — learned absolute (GPT-2)

A (max_pos, d_model) embedding table. Looked up by position, added to token embeddings.

✅ Trains end-to-end.
❌ Hard cap at max_pos (GPT-2: 1024). Can't extend without retraining.
❌ No structural inductive bias for "tokens close in position should attend more".

4. Era 3 — Rotary (RoPE) — used by Llama, Mistral, Qwen, DeepSeek

Instead of adding a position vector to the embedding, RoPE rotates Q and K pairs by an angle proportional to position. For position m and a pair (q_{2i}, q_{2i+1}):

[ cos(mθ_i)  −sin(mθ_i) ] [ q_{2i}   ]
[ sin(mθ_i)   cos(mθ_i) ] [ q_{2i+1} ]

with frequencies θ_i = 10000^(−2i/d).

The magic: (R_m q)ᵀ (R_n k) depends only on m − n — the relative offset — even though we encoded absolute positions. So attention scores naturally know "how far apart are these two tokens?" without any extra mechanism.

Why everyone moved to RoPE:

✅ Relative information for free.
✅ Extends to longer contexts via base-frequency scaling (NTK-aware, YaRN).
✅ Cheap — just two multiplies per element.

5. The MLP — where most of the parameters and facts live

After attention, every block applies a position-wise MLP:

MLP(x) = W_2 · activation(W_1 · x + b_1) + b_2

with W_1: d → 4d, W_2: 4d → d. So the hidden dim is 4× the model dim.

For Llama-3-8B: d=4096 → hidden=14336 (≈3.5×, they use SwiGLU which changes the constant). Each block has ~117M MLP params; attention has ~67M. MLPs are roughly 2/3 of the parameters.
The activation has evolved: ReLU → GELU (BERT, GPT-2) → SwiGLU (PaLM, Llama).

Mechanistic interpretability work (Anthropic, OpenAI) shows MLPs implementing key-value lookups: certain neurons fire on "Eiffel Tower" and write "Paris" into the residual stream. Most of the model's factual knowledge is in the MLP weights. Attention chooses what to mix; MLP decides what to write.

6. LayerNorm — pre-norm vs post-norm

Every sub-layer is wrapped in either:

Pre-norm (Llama, GPT-NeoX, most modern): x = x + sublayer(LN(x))
Post-norm (original Transformer): x = LN(x + sublayer(x))

Pre-norm trains more stably at depth — you can stack 80 blocks without warmup tricks. Post-norm gives slightly better final perplexity but needs careful warmup. Default to pre-norm.

LN itself: per-token, per-layer, normalise to zero mean and unit variance, then apply learnable scale γ and shift β:

LN(x) = γ * (x − mean(x)) / √(var(x) + ε) + β

RMSNorm (Llama variant) drops the mean centring — same effect, slightly cheaper:

RMSNorm(x) = γ * x / √(mean(x²) + ε)

7. KV cache — what actually grows in memory at inference

When you generate token by token, for every new token you compute Q from the latest position only, but you need K and V from every previous position. Standard impl: store K/V tensors per layer, per head, per position — the KV cache.

Size in bytes:

KV_bytes = 2 · B · T · L · d_model · dtype_bytes

(The leading 2 is for K + V.)

Concrete numbers for Llama-3-70B at fp16:

L = 80, d = 8192, dtype = 2 bytes
Per token, per request: 2 · 80 · 8192 · 2 = 2,621,440 bytes ≈ 2.5 MB/token
A 4096-token conversation: ~10 GB just for KV cache for one request.

Two big consequences:

GQA (Grouped-Query Attention, Llama-2 onward) shares K and V across groups of heads — Llama-3-70B uses 8 KV heads for 64 query heads. Drops KV cache by 8×.
PagedAttention (vLLM, Session 30) treats KV cache like virtual memory pages. Lets you batch many requests without pre-allocating worst-case memory.

8. Encoder-only / decoder-only / encoder-decoder

Family	Examples	Use case
Encoder-only	BERT, RoBERTa	Classification, embeddings, retrieval
Decoder-only	GPT, Llama, Claude	Generation (this is the default now)
Encoder-decoder	T5, Whisper, Flan	Translation, summarisation, ASR

Decoder-only won because:

Same architecture handles any task with prompting (instruction tuning).
Causal mask is conceptually simpler than encoder→decoder cross-attention.
Scales — pretty much every frontier model since 2022 is decoder-only.

9. Concrete numbers — Llama-3-70B by the slice

Layer slice	Params	% of total
Embedding (vocab×d)	128k × 8192	~1.0B (~1.5%)
Attention W_Q/W_K/W_V/W_O	80 × ~268M	~21.4B (~31%)
MLP (SwiGLU)	80 × ~570M	~45.6B (~65%)
LN params	small	<1%
Total	~70.6B	100%

MLPs are where the weights are; attention is where the routing happens.

10. Hands-on (30 min)

Extend last session's CausalSelfAttention into a full block:

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, mlp_ratio=4):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = CausalSelfAttention(d_model, n_heads)
        self.ln2 = nn.LayerNorm(d_model)
        self.mlp = nn.Sequential(
            nn.Linear(d_model, mlp_ratio * d_model),
            nn.GELU(),
            nn.Linear(mlp_ratio * d_model, d_model),
        )

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

Stack 6 of them, add embedding + final LN + unembed projection. Train on tiny-shakespeare (~1M chars) for 5k steps. You'll get vaguely Shakespeare-like output and a working mental model of every shape that flows through.

11. What's next (Session 9 — RAG Part 1)

Why RAG exists (the limits of context windows and finetuning)
Chunking strategies
Embeddings overview (deep treatment in Session 17)
Vector stores (Faiss, pgvector, Pinecone, Weaviate)

Link: https://leetcode.com/problems/longest-substring-without-repeating-characters/
Difficulty: Medium
Why this problem: Sliding window with a hash-set; shrink left when you see a repeat.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

State why attention needs position info (set vs sequence).
Derive that RoPE makes attention depend on relative offset.
Compute KV cache size for Llama-3-70B at T=4096.
Explain why MLPs hold most parameters and most facts.
Choose pre-norm vs post-norm and defend the choice for a 60-layer model.
Solve longest-substring-without-repeating-characters with sliding window.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Gradient Boosted Trees Part 1 — Boosting Intuition, Trees, Loss

Spark Part 2 — Shuffles, Catalyst, AQE, Tuning