Search Tech Journey

Find topics, journeys and posts

back to blog
ai mlintermediate 12m2026-06-09

Transformers Part 2 — Positional Encoding, RoPE, MLP, LayerNorm

Session 6 of the 48-session learning series.

Date: Sun, 2026-06-14 · Time: 09:00–11:00 IST · Track: 🧠 LLMs & Agents (LLM) · Parent 28-day topic: Day 01 · Est. read: 2 h

Why this session matters

This is Session 06 of 48 in the LLMs & Agents track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

  • Why position has to be injected at all (attention is permutation-invariant)
  • Three eras of positional encoding — sinusoidal, learned, RoPE
  • RoPE math — rotate Q/K pairs, relative position emerges from QᵀK
  • The MLP — 4× expansion, GELU/SwiGLU, where the facts live
  • KV cache anatomy — what actually grows at inference time

Pre-read (skim before the session)

Deep dive

1. Why position has to be injected at all

Self-attention is a set operation. Permute the inputs and the outputs permute the same way — but the content of each output doesn't change. A transformer without position info would see "dog bites man" and "man bites dog" as identical bags of words.

We need to inject position. Three eras of how:

2. Era 1 — sinusoidal (original "Attention Is All You Need")

Each position gets a deterministic vector built from sines and cosines at log-spaced frequencies:

PE(pos, 2i)   = sin(pos / 10000^(2i/d))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d))

Added (not concatenated) to the token embedding. Properties:

  • Deterministic — no parameters to train.
  • Periodic — each frequency repeats; combinations give unique position vectors up to a long horizon.
  • Extrapolates poorly to lengths beyond what was trained on.

3. Era 2 — learned absolute (GPT-2)

A (max_pos, d_model) embedding table. Looked up by position, added to token embeddings.

  • ✅ Trains end-to-end.
  • ❌ Hard cap at max_pos (GPT-2: 1024). Can't extend without retraining.
  • ❌ No structural inductive bias for "tokens close in position should attend more".

4. Era 3 — Rotary (RoPE) — used by Llama, Mistral, Qwen, DeepSeek

Instead of adding a position vector to the embedding, RoPE rotates Q and K pairs by an angle proportional to position. For position m and a pair (q_{2i}, q_{2i+1}):

[ cos(mθ_i)  −sin(mθ_i) ] [ q_{2i}   ]
[ sin(mθ_i)   cos(mθ_i) ] [ q_{2i+1} ]

with frequencies θ_i = 10000^(−2i/d).

The magic: (R_m q)ᵀ (R_n k) depends only on m − n — the relative offset — even though we encoded absolute positions. So attention scores naturally know "how far apart are these two tokens?" without any extra mechanism.

Why everyone moved to RoPE:

  • ✅ Relative information for free.
  • ✅ Extends to longer contexts via base-frequency scaling (NTK-aware, YaRN).
  • ✅ Cheap — just two multiplies per element.

5. The MLP — where most of the parameters and facts live

After attention, every block applies a position-wise MLP:

MLP(x) = W_2 · activation(W_1 · x + b_1) + b_2

with W_1: d → 4d, W_2: 4d → d. So the hidden dim is 4× the model dim.

  • For Llama-3-8B: d=4096 → hidden=14336 (≈3.5×, they use SwiGLU which changes the constant). Each block has ~117M MLP params; attention has ~67M. MLPs are roughly 2/3 of the parameters.
  • The activation has evolved: ReLU → GELU (BERT, GPT-2) → SwiGLU (PaLM, Llama).

Mechanistic interpretability work (Anthropic, OpenAI) shows MLPs implementing key-value lookups: certain neurons fire on "Eiffel Tower" and write "Paris" into the residual stream. Most of the model's factual knowledge is in the MLP weights. Attention chooses what to mix; MLP decides what to write.

6. LayerNorm — pre-norm vs post-norm

Every sub-layer is wrapped in either:

  • Pre-norm (Llama, GPT-NeoX, most modern): x = x + sublayer(LN(x))
  • Post-norm (original Transformer): x = LN(x + sublayer(x))

Pre-norm trains more stably at depth — you can stack 80 blocks without warmup tricks. Post-norm gives slightly better final perplexity but needs careful warmup. Default to pre-norm.

LN itself: per-token, per-layer, normalise to zero mean and unit variance, then apply learnable scale γ and shift β:

LN(x) = γ * (x − mean(x)) / √(var(x) + ε) + β

RMSNorm (Llama variant) drops the mean centring — same effect, slightly cheaper:

RMSNorm(x) = γ * x / √(mean(x²) + ε)

7. KV cache — what actually grows in memory at inference

When you generate token by token, for every new token you compute Q from the latest position only, but you need K and V from every previous position. Standard impl: store K/V tensors per layer, per head, per position — the KV cache.

Size in bytes:

KV_bytes = 2 · B · T · L · d_model · dtype_bytes

(The leading 2 is for K + V.)

Concrete numbers for Llama-3-70B at fp16:

  • L = 80, d = 8192, dtype = 2 bytes
  • Per token, per request: 2 · 80 · 8192 · 2 = 2,621,440 bytes ≈ 2.5 MB/token
  • A 4096-token conversation: ~10 GB just for KV cache for one request.

Two big consequences:

  1. GQA (Grouped-Query Attention, Llama-2 onward) shares K and V across groups of heads — Llama-3-70B uses 8 KV heads for 64 query heads. Drops KV cache by 8×.
  2. PagedAttention (vLLM, Session 30) treats KV cache like virtual memory pages. Lets you batch many requests without pre-allocating worst-case memory.

8. Encoder-only / decoder-only / encoder-decoder

FamilyExamplesUse case
Encoder-onlyBERT, RoBERTaClassification, embeddings, retrieval
Decoder-onlyGPT, Llama, ClaudeGeneration (this is the default now)
Encoder-decoderT5, Whisper, FlanTranslation, summarisation, ASR

Decoder-only won because:

  • Same architecture handles any task with prompting (instruction tuning).
  • Causal mask is conceptually simpler than encoder→decoder cross-attention.
  • Scales — pretty much every frontier model since 2022 is decoder-only.

9. Concrete numbers — Llama-3-70B by the slice

Layer sliceParams% of total
Embedding (vocab×d)128k × 8192~1.0B (~1.5%)
Attention W_Q/W_K/W_V/W_O80 × ~268M~21.4B (~31%)
MLP (SwiGLU)80 × ~570M~45.6B (~65%)
LN paramssmall<1%
Total~70.6B100%

MLPs are where the weights are; attention is where the routing happens.

10. Hands-on (30 min)

Extend last session's CausalSelfAttention into a full block:

class TransformerBlock(nn.Module):
    def __init__(self, d_model, n_heads, mlp_ratio=4):
        super().__init__()
        self.ln1 = nn.LayerNorm(d_model)
        self.attn = CausalSelfAttention(d_model, n_heads)
        self.ln2 = nn.LayerNorm(d_model)
        self.mlp = nn.Sequential(
            nn.Linear(d_model, mlp_ratio * d_model),
            nn.GELU(),
            nn.Linear(mlp_ratio * d_model, d_model),
        )

    def forward(self, x):
        x = x + self.attn(self.ln1(x))
        x = x + self.mlp(self.ln2(x))
        return x

Stack 6 of them, add embedding + final LN + unembed projection. Train on tiny-shakespeare (~1M chars) for 5k steps. You'll get vaguely Shakespeare-like output and a working mental model of every shape that flows through.

11. What's next (Session 9 — RAG Part 1)

  • Why RAG exists (the limits of context windows and finetuning)
  • Chunking strategies
  • Embeddings overview (deep treatment in Session 17)
  • Vector stores (Faiss, pgvector, Pinecone, Weaviate)

Reading material

In-depth research material

Video reference

▶︎ Andrej Karpathy — Let's build GPT from scratch

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Longest Substring Without Repeating Characters

Post-session checklist

By the end of this session you should be able to:

  • State why attention needs position info (set vs sequence).
  • Derive that RoPE makes attention depend on relative offset.
  • Compute KV cache size for Llama-3-70B at T=4096.
  • Explain why MLPs hold most parameters and most facts.
  • Choose pre-norm vs post-norm and defend the choice for a 60-layer model.
  • Solve longest-substring-without-repeating-characters with sliding window.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.