Day 01 — Transformer Internals — Attention, Embeddings, Positional Encoding
Every modern LLM, agent and RAG stack rests on the transformer. Knowing how Q/K/V flow through multi-head attention with residual streams is the unlock for prom…
The transformer replaced recurrence with attention and a small set of repeated primitives. To use modern LLMs well you need to feel the data flow, not just the formulas.
🧠 Concept
Why it matters & the mental model.
1. The residual stream as a bus
Think of a decoder-only transformer as a fixed-width "residual stream" of shape (B, T, d_model). Every block reads the stream, writes a small update back in, and the stream carries the running interpretation of every token position. This is why deep transformers can be stacked: each block is a refinement, not a replacement.
2. Self-attention as soft routing
Inside a block, attention is a content-addressable router. For each token we project the residual stream to Q, K, V of width d_head. The score softmax(QKᵀ / √d_k) is a probability over previous positions; we then take a weighted sum of values. The √d_k scaling keeps the pre-softmax logits in a regime where gradients flow — without it, at d_model=4096 the dot products blow up to ~64 and softmax saturates into near one-hot, killing learning.
Multi-head attention (MHA) repeats this h times in parallel with smaller d_head = d_model/h, then concatenates and projects. Heads specialise: some track syntax (subject-verb), some carry positional offsets, some implement induction-head copying patterns that drive in-context learning.
🛠 Deep Dive
Internals, code, architecture.
3. Positional encoding — fixed vs learned vs RoPE
Attention is permutation-invariant, so position must be injected. Three eras:
- Sinusoidal (original):
PE(pos, 2i) = sin(pos / 10000^(2i/d)), deterministic, extrapolates poorly. - Learned absolute (GPT-2): row in an embedding table, hard cap on context.
- Rotary (RoPE, used in Llama, Mistral, Qwen): rotates Q and K pairs in 2D subspaces by an angle ∝ position. Relative offsets fall out naturally from
Qᵀ K, and you can scale the base frequency (NTK / YaRN) to extend context length without retraining from scratch.
4. The MLP — where knowledge lives
After attention each block has a position-wise MLP: Linear(d, 4d) → GELU/SwiGLU → Linear(4d, d). Most of the parameter count and most of the facts live here; attention chooses what to mix, the MLP decides what to write back into the stream. Mechanistic interpretability work (Anthropic, OpenAI) shows MLPs implementing key-value lookups: certain neurons fire on "Eiffel Tower" and write "Paris" into the residual stream.
5. LayerNorm and residual connections
Every sub-layer is wrapped: x = x + sublayer(LN(x)) (pre-norm) or x = LN(x + sublayer(x)) (post-norm). Pre-norm trains more stably for very deep stacks (Llama, GPT-NeoX). The residual is essential — it lets gradients flow directly from logits back to the embedding even through 96 layers.
🚀 In Practice
Trade-offs, exercises, what to ship today.
6. Encoder-only / decoder-only / encoder-decoder
- Encoder-only (BERT): bidirectional attention, masked LM, used for classification, retrieval embeddings.
- Decoder-only (GPT, Llama, Claude): causal mask, next-token prediction, the default for generation.
- Encoder-decoder (T5, Flan, Whisper): cross-attention from decoder into encoder output; great for translation, summarisation, ASR.
7. Why this matters for agents
Agents repeatedly stuff long context (tools, memory, traces) into the model. The two pain points are O(T²) attention and the KV cache that grows linearly with T. Choosing GQA models, using prefix caching, and trimming context aggressively are the practical levers you'll pull tomorrow.
8. Hands-on checklist
- Implement a 1-block GPT and overfit on
tiny-shakespeare. - Print the attention weights for a single head on a 30-token input and look for syntactic structure.
- Compare loss with and without √d_k scaling at d_head=128.
- Read Anthropic's "A Mathematical Framework for Transformer Circuits" once the implementation makes sense.
Resources
- 🎥 3Blue1Brown — Attention in transformers, step by step
- 📖 The Illustrated Transformer — Jay Alammar
- 📖 Andrej Karpathy — Let's build GPT from scratch
- 📖 Attention Is All You Need (paper)
LeetCode: Two Sum (Easy)
Part of the 28-day 2026-05-30 prep program.
Mid-article nudge
Liked this so far? Subscribe and the next deep dive lands in your inbox Monday.