ai mlintermediate 8m2026-05-28

How Transformers actually attend

Beyond the textbook diagrams — what a single attention head is really computing, how multi-head splits the world, and why scaling laws keep rewarding bigger context.

The first time I drew an attention diagram on a whiteboard, I lied a little. Every box looked the same. Every arrow looked the same. The diagram said attention but the math was hidden behind opaque names like query, key, value.

Let me redraw it the way I wish someone had drawn it for me.

A token is a vector, and that's the whole game

Before attention runs, every token in your input is just a learned dense vector — typically 768 or 1024 numbers. That vector carries:

some idea of what the token means (semantic content)
some idea of where it sits in the sequence (positional encoding)

Everything attention does is transform these vectors into better vectors — vectors that have looked sideways at their neighbors and picked up useful context.

Attention, in one sentence

For each token, look at every other token, decide how much each one matters, and blend them.

That's it. The Q/K/V trick is just a clean way to do that with matrices.

What Q, K, V actually mean

For each token vector x:

const q = x @ W_q;  // "what am I looking for?"
const k = x @ W_k;  // "what do I offer?"
const v = x @ W_v;  // "what do I carry?"

Q is a search query the token sends out.
K is a label each token wears so it can be found.
V is the actual content a token contributes if matched.

Attention then computes, for every pair (i, j):

score_ij = softmax( q_i · k_j  /  sqrt(d_k) )
output_i = sum_j score_ij * v_j

Key points

Multi-head: many specialized viewers

A single attention head learns one relational view. Real systems use 8–64 heads in parallel. Each head gets its own W_q, W_k, W_v — so head 3 might learn "subject → verb agreement" while head 11 might learn "match closing brackets across long distances".

You then concatenate every head's output and project once more. That's it. Multi-head is literally "run attention K times in parallel with different lenses, then mix the answers."

Why scaling keeps working

Each additional attention head, layer, and context slot expands the graph of relations the model can express. The cost is quadratic in context length, but the representational gain compounds. That's why we keep paying for longer windows — every doubling unlocks structurally new behaviors (long retrieval, multi-document reasoning, code refactors across files).

Where to go next

Pair this with the Recommendation Systems journey — attention is the same trick we use to rank candidate items by relevance.
Read the next post on positional embeddings (sinusoidal vs RoPE vs ALiBi) for the part this post glossed over.
Try implementing 1 head, 1 layer, 1 batch in 40 lines of Python. The intuition lands when you watch the dot products with your own eyes.

← previous

Designing a recommendation system from scratch

From Novice to Fluent on the Modern Microsoft Web Stack — 22-Chapter Self-Study Plan