How Transformers actually attend
Beyond the textbook diagrams — what a single attention head is really computing, how multi-head splits the world, and why scaling laws keep rewarding bigger context.
The first time I drew an attention diagram on a whiteboard, I lied a little. Every box looked the same. Every arrow looked the same. The diagram said attention but the math was hidden behind opaque names like query, key, value.
Let me redraw it the way I wish someone had drawn it for me.
A token is a vector, and that's the whole game
Before attention runs, every token in your input is just a learned dense vector — typically 768 or 1024 numbers. That vector carries:
- some idea of what the token means (semantic content)
- some idea of where it sits in the sequence (positional encoding)
Everything attention does is transform these vectors into better vectors — vectors that have looked sideways at their neighbors and picked up useful context.
Attention, in one sentence
For each token, look at every other token, decide how much each one matters, and blend them.
That's it. The Q/K/V trick is just a clean way to do that with matrices.
What Q, K, V actually mean
For each token vector x:
const q = x @ W_q; // "what am I looking for?"
const k = x @ W_k; // "what do I offer?"
const v = x @ W_v; // "what do I carry?"
- Q is a search query the token sends out.
- K is a label each token wears so it can be found.
- V is the actual content a token contributes if matched.
Attention then computes, for every pair (i, j):
score_ij = softmax( q_i · k_j / sqrt(d_k) )
output_i = sum_j score_ij * v_j
Multi-head: many specialized viewers
A single attention head learns one relational view. Real systems use 8–64 heads in parallel. Each head gets its own W_q, W_k, W_v — so head 3 might learn "subject → verb agreement" while head 11 might learn "match closing brackets across long distances".
You then concatenate every head's output and project once more. That's it. Multi-head is literally "run attention K times in parallel with different lenses, then mix the answers."
Why scaling keeps working
Each additional attention head, layer, and context slot expands the graph of relations the model can express. The cost is quadratic in context length, but the representational gain compounds. That's why we keep paying for longer windows — every doubling unlocks structurally new behaviors (long retrieval, multi-document reasoning, code refactors across files).
Where to go next
- Pair this with the Recommendation Systems journey — attention is the same trick we use to rank candidate items by relevance.
- Read the next post on positional embeddings (sinusoidal vs RoPE vs ALiBi) for the part this post glossed over.
- Try implementing 1 head, 1 layer, 1 batch in 40 lines of Python. The intuition lands when you watch the dot products with your own eyes.
Mid-article nudge
Liked this so far? Subscribe and the next deep dive lands in your inbox Monday.