Search Tech Journey

Find topics, journeys and posts

back to blog
ai mladvanced 12m2026-05-30

Day 01 — Transformer Internals — Attention, Embeddings, Positional Encoding

Every modern LLM, agent and RAG stack rests on the transformer. Knowing how Q/K/V flow through multi-head attention with residual streams is the unlock for prom…

The transformer replaced recurrence with attention and a small set of repeated primitives. To use modern LLMs well you need to feel the data flow, not just the formulas.

🧠 Concept

Why it matters & the mental model.

1. The residual stream as a bus

Think of a decoder-only transformer as a fixed-width "residual stream" of shape (B, T, d_model). Every block reads the stream, writes a small update back in, and the stream carries the running interpretation of every token position. This is why deep transformers can be stacked: each block is a refinement, not a replacement.

2. Self-attention as soft routing

Inside a block, attention is a content-addressable router. For each token we project the residual stream to Q, K, V of width d_head. The score softmax(QKᵀ / √d_k) is a probability over previous positions; we then take a weighted sum of values. The √d_k scaling keeps the pre-softmax logits in a regime where gradients flow — without it, at d_model=4096 the dot products blow up to ~64 and softmax saturates into near one-hot, killing learning.

Multi-head attention (MHA) repeats this h times in parallel with smaller d_head = d_model/h, then concatenates and projects. Heads specialise: some track syntax (subject-verb), some carry positional offsets, some implement induction-head copying patterns that drive in-context learning.

🛠 Deep Dive

Internals, code, architecture.

3. Positional encoding — fixed vs learned vs RoPE

Attention is permutation-invariant, so position must be injected. Three eras:

  • Sinusoidal (original): PE(pos, 2i) = sin(pos / 10000^(2i/d)), deterministic, extrapolates poorly.
  • Learned absolute (GPT-2): row in an embedding table, hard cap on context.
  • Rotary (RoPE, used in Llama, Mistral, Qwen): rotates Q and K pairs in 2D subspaces by an angle ∝ position. Relative offsets fall out naturally from Qᵀ K, and you can scale the base frequency (NTK / YaRN) to extend context length without retraining from scratch.

4. The MLP — where knowledge lives

After attention each block has a position-wise MLP: Linear(d, 4d) → GELU/SwiGLU → Linear(4d, d). Most of the parameter count and most of the facts live here; attention chooses what to mix, the MLP decides what to write back into the stream. Mechanistic interpretability work (Anthropic, OpenAI) shows MLPs implementing key-value lookups: certain neurons fire on "Eiffel Tower" and write "Paris" into the residual stream.

5. LayerNorm and residual connections

Every sub-layer is wrapped: x = x + sublayer(LN(x)) (pre-norm) or x = LN(x + sublayer(x)) (post-norm). Pre-norm trains more stably for very deep stacks (Llama, GPT-NeoX). The residual is essential — it lets gradients flow directly from logits back to the embedding even through 96 layers.

🚀 In Practice

Trade-offs, exercises, what to ship today.

6. Encoder-only / decoder-only / encoder-decoder

  • Encoder-only (BERT): bidirectional attention, masked LM, used for classification, retrieval embeddings.
  • Decoder-only (GPT, Llama, Claude): causal mask, next-token prediction, the default for generation.
  • Encoder-decoder (T5, Flan, Whisper): cross-attention from decoder into encoder output; great for translation, summarisation, ASR.

7. Why this matters for agents

Agents repeatedly stuff long context (tools, memory, traces) into the model. The two pain points are O(T²) attention and the KV cache that grows linearly with T. Choosing GQA models, using prefix caching, and trimming context aggressively are the practical levers you'll pull tomorrow.

8. Hands-on checklist

  • Implement a 1-block GPT and overfit on tiny-shakespeare.
  • Print the attention weights for a single head on a 30-token input and look for syntactic structure.
  • Compare loss with and without √d_k scaling at d_head=128.
  • Read Anthropic's "A Mathematical Framework for Transformer Circuits" once the implementation makes sense.
Key points

    Resources

    LeetCode: Two Sum (Easy)


    Part of the 28-day 2026-05-30 prep program.

    Mid-article nudge

    Liked this so far? Subscribe and the next deep dive lands in your inbox Monday.

    Related concepts

    AttentionRoPEKV cacheMixture of ExpertsTwo-tower ranking