Transformers Part 2 — Positional Encoding, RoPE, MLP, LayerNorm
Session 6 of the 48-session learning series.
Date: Sun, 2026-06-14 · Time: 09:00–11:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 01 · Est. read: 2 h
Why this session matters
This is Session 06 of 48 in the LLMs & Agents track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- Why position has to be injected at all (attention is permutation-invariant)
- Three eras of positional encoding — sinusoidal, learned, RoPE
- RoPE math — rotate Q/K pairs, relative position emerges from QᵀK
- The MLP — 4× expansion, GELU/SwiGLU, where the facts live
- KV cache anatomy — what actually grows at inference time
Pre-read (skim before the session)
- Andrej Karpathy — Let's build GPT from scratch (video)
- RoFormer: Enhanced Transformer with Rotary Position Embedding
- EleutherAI — Rotary Embeddings: A Comprehensive Guide
- FlashAttention paper
Deep dive
1. Why position has to be injected at all
Self-attention is a set operation. Permute the inputs and the outputs permute the same way — but the content of each output doesn't change. A transformer without position info would see "dog bites man" and "man bites dog" as identical bags of words.
We need to inject position. Three eras of how:
2. Era 1 — sinusoidal (original "Attention Is All You Need")
Each position gets a deterministic vector built from sines and cosines at log-spaced frequencies:
PE(pos, 2i) = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))
Added (not concatenated) to the token embedding. Properties:
- Deterministic — no parameters to train.
- Periodic — each frequency repeats; combinations give unique position vectors up to a long horizon.
- Extrapolates poorly to lengths beyond what was trained on.
3. Era 2 — learned absolute (GPT-2)
A (max_pos, d_model) embedding table. Looked up by position, added to token embeddings.
- ✅ Trains end-to-end.
- ❌ Hard cap at
max_pos(GPT-2: 1024). Can't extend without retraining. - ❌ No structural inductive bias for "tokens close in position should attend more".
4. Era 3 — Rotary (RoPE) — used by Llama, Mistral, Qwen, DeepSeek
Instead of adding a position vector to the embedding, RoPE rotates Q and K pairs by an angle proportional to position. For position m and a pair (q_{2i}, q_{2i+1}):
[ cos(mθ_i) −sin(mθ_i) ] [ q_{2i} ]
[ sin(mθ_i) cos(mθ_i) ] [ q_{2i+1} ]
with frequencies θ_i = 10000^(−2i/d).
The magic: (R_m q)ᵀ (R_n k) depends only on m − n — the relative offset — even though we encoded absolute positions. So attention scores naturally know "how far apart are these two tokens?" without any extra mechanism.
Why everyone moved to RoPE:
- ✅ Relative information for free.
- ✅ Extends to longer contexts via base-frequency scaling (NTK-aware, YaRN).
- ✅ Cheap — just two multiplies per element.
5. The MLP — where most of the parameters and facts live
After attention, every block applies a position-wise MLP:
MLP(x) = W_2 · activation(W_1 · x + b_1) + b_2
with W_1: d → 4d, W_2: 4d → d. So the hidden dim is 4× the model dim.
- For Llama-3-8B: d=4096 → hidden=14336 (≈3.5×, they use SwiGLU which changes the constant). Each block has ~117M MLP params; attention has ~67M. MLPs are roughly 2/3 of the parameters.
- The activation has evolved: ReLU → GELU (BERT, GPT-2) → SwiGLU (PaLM, Llama).
Mechanistic interpretability work (Anthropic, OpenAI) shows MLPs implementing key-value lookups: certain neurons fire on "Eiffel Tower" and write "Paris" into the residual stream. Most of the model's factual knowledge is in the MLP weights. Attention chooses what to mix; MLP decides what to write.
6. LayerNorm — pre-norm vs post-norm
Every sub-layer is wrapped in either:
- Pre-norm (Llama, GPT-NeoX, most modern):
x = x + sublayer(LN(x)) - Post-norm (original Transformer):
x = LN(x + sublayer(x))
Pre-norm trains more stably at depth — you can stack 80 blocks without warmup tricks. Post-norm gives slightly better final perplexity but needs careful warmup. Default to pre-norm.
LN itself: per-token, per-layer, normalise to zero mean and unit variance, then apply learnable scale γ and shift β:
LN(x) = γ * (x − mean(x)) / √(var(x) + ε) + β
RMSNorm (Llama variant) drops the mean centring — same effect, slightly cheaper:
RMSNorm(x) = γ * x / √(mean(x²) + ε)
7. KV cache — what actually grows in memory at inference
When you generate token by token, for every new token you compute Q from the latest position only, but you need K and V from every previous position. Standard impl: store K/V tensors per layer, per head, per position — the KV cache.
Size in bytes:
KV_bytes = 2 · B · T · L · d_model · dtype_bytes
(The leading 2 is for K + V.)
Concrete numbers for Llama-3-70B at fp16:
- L = 80, d = 8192, dtype = 2 bytes
- Per token, per request:
2 · 80 · 8192 · 2 = 2,621,440 bytes ≈ 2.5 MB/token - A 4096-token conversation: ~10 GB just for KV cache for one request.
Two big consequences:
- GQA (Grouped-Query Attention, Llama-2 onward) shares K and V across groups of heads — Llama-3-70B uses 8 KV heads for 64 query heads. Drops KV cache by 8×.
- PagedAttention (vLLM, Session 30) treats KV cache like virtual memory pages. Lets you batch many requests without pre-allocating worst-case memory.
8. Encoder-only / decoder-only / encoder-decoder
| Family | Examples | Use case |
|---|---|---|
| Encoder-only | BERT, RoBERTa | Classification, embeddings, retrieval |
| Decoder-only | GPT, Llama, Claude | Generation (this is the default now) |
| Encoder-decoder | T5, Whisper, Flan | Translation, summarisation, ASR |
Decoder-only won because:
- Same architecture handles any task with prompting (instruction tuning).
- Causal mask is conceptually simpler than encoder→decoder cross-attention.
- Scales — pretty much every frontier model since 2022 is decoder-only.
9. Concrete numbers — Llama-3-70B by the slice
| Layer slice | Params | % of total |
|---|---|---|
| Embedding (vocab×d) | 128k × 8192 | ~1.0B (~1.5%) |
| Attention W_Q/W_K/W_V/W_O | 80 × ~268M | ~21.4B (~31%) |
| MLP (SwiGLU) | 80 × ~570M | ~45.6B (~65%) |
| LN params | small | <1% |
| Total | ~70.6B | 100% |
MLPs are where the weights are; attention is where the routing happens.
10. Hands-on (30 min)
Extend last session's CausalSelfAttention into a full block:
class TransformerBlock(nn.Module):
def __init__(self, d_model, n_heads, mlp_ratio=4):
super().__init__()
self.ln1 = nn.LayerNorm(d_model)
self.attn = CausalSelfAttention(d_model, n_heads)
self.ln2 = nn.LayerNorm(d_model)
self.mlp = nn.Sequential(
nn.Linear(d_model, mlp_ratio * d_model),
nn.GELU(),
nn.Linear(mlp_ratio * d_model, d_model),
)
def forward(self, x):
x = x + self.attn(self.ln1(x))
x = x + self.mlp(self.ln2(x))
return x
Stack 6 of them, add embedding + final LN + unembed projection. Train on tiny-shakespeare (~1M chars) for 5k steps. You'll get vaguely Shakespeare-like output and a working mental model of every shape that flows through.
11. What's next (Session 9 — RAG Part 1)
- Why RAG exists (the limits of context windows and finetuning)
- Chunking strategies
- Embeddings overview (deep treatment in Session 17)
- Vector stores (Faiss, pgvector, Pinecone, Weaviate)
Reading material
In-depth research material
- EleutherAI — Rotary Embeddings explainer
- YaRN — Efficient Context Window Extension
- Anthropic — Mathematical Framework for Transformer Circuits
- RMSNorm paper
Video reference
▶︎ Andrej Karpathy — Let's build GPT from scratch
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Longest Substring Without Repeating Characters
- Link: https://leetcode.com/problems/longest-substring-without-repeating-characters/
- Difficulty: Medium
- Why this problem: Sliding window with a hash-set; shrink left when you see a repeat.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- State why attention needs position info (set vs sequence).
- Derive that RoPE makes attention depend on relative offset.
- Compute KV cache size for Llama-3-70B at T=4096.
- Explain why MLPs hold most parameters and most facts.
- Choose pre-norm vs post-norm and defend the choice for a 60-layer model.
- Solve
longest-substring-without-repeating-characterswith sliding window.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.