CAP, PACELC, Quorums — How Distributed Systems Actually Trade Off
Session 12 of the 48-session learning series.
Why this session matters
This is Session 12 of 48 in the System Design track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- CAP — what it actually says (and what it doesn’t)
- PACELC — the 'else' Brewer left out and why it matters in practice
- Consensus 101 — Raft in one diagram + the role of quorums
- Real systems on the CAP map — Dynamo, Cassandra, Spanner, Postgres
- Practical guidance — pick consistency at the operation level, not the system level
Pre-read (skim before the session)
- Martin Kleppmann — Distributed Systems lectures
- Brewer 2012 — CAP twelve years later
- Abadi 2012 — Consistency Tradeoffs in Modern Distributed DB Design
- Raft paper — In Search of an Understandable Consensus Algorithm
Deep dive
1. CAP — what it actually says
Brewer's CAP theorem: in the presence of a network partition, you must choose between consistency (every read sees the latest write) and availability (every request gets a non-error response).
What people get wrong:
- CAP is not a system-wide property. It applies per operation, per partition event.
- "AP" doesn’t mean inconsistent forever; it means you'll eventually converge.
- "CP" doesn’t mean infinitely consistent; it means you reject writes when you can't guarantee it.
During normal operation (no partition), modern systems are both consistent AND available. CAP only fires during a partition.
2. PACELC — the 'else' clause
Abadi 2012 extended CAP:
If there is a Partition, trade Availability vs Consistency. Else (during normal operation), trade Latency vs Consistency.
This is closer to how engineers actually feel the trade-off. Linearisable systems pay latency for every write (cross-region round-trips, consensus rounds); eventually-consistent systems are fast all the time but converge later.
Partition? Choose
---------- ------
yes ———▶ A or C
no ———▶ L or C
3. Quorums
A quorum is the minimum number of replicas that must respond for a read or write. For N replicas, common settings:
W= writes acked by W replicasR= reads gather R replicas- Strong consistency if
W + R > N(writes and reads always overlap).
Examples:
- N=3, W=3, R=1 — strong, slow writes, fast reads. Postgres replication w/ sync replica.
- N=3, W=2, R=2 — strong, balanced. Cassandra QUORUM.
- N=3, W=1, R=1 — fast, eventually consistent. Cassandra ONE.
4. Raft in one diagram
Raft is the most popular consensus algorithm in production (etcd, Consul, TiKV, CockroachDB, KRaft, MongoDB internals).
All nodes hold a replicated log.
One leader at a time.
Client → Leader: "append entry X"
Leader → Followers: AppendEntries(X)
Followers respond OK
Once majority OK, leader commits, applies to state machine,
responds to client.
Elections:
Followers heartbeat-timeout → candidate → request votes →
if majority votes, becomes leader.
Properties:
- Safety — at most one leader per term; committed entries never lost.
- Liveness — makes progress as long as a majority is up and can talk.
- Linearisability — the log defines a single global order of operations.
Failure handling: leader dies → election (~150–300 ms typical) → new leader; clients retry.
5. Real systems on the CAP/PACELC map
| System | Default consistency model | Notes |
|---|---|---|
| Postgres (single) | Strong + ACID | Not distributed in this mode |
| Postgres + sync replica | Strong (CP, sync writes) | Pays latency to one replica |
| Cassandra | Tunable (per-op consistency level) | AP at QUORUM with N=3, R=1, W=1 — eventual |
| DynamoDB | Eventual by default; consistent reads opt-in | AP/CP per-op |
| MongoDB (replica set) | Linearisable reads if you opt in | Defaults to majority reads |
| etcd / Consul / Zookeeper | Strong, linearisable (Raft / Zab) | CP |
| Spanner / CockroachDB | Strong, externally consistent (TrueTime / HLC + Raft) | CP, low-latency multi-region |
| Kafka | Per-partition ordered + replicated | Strong within a partition; configurable on losses |
The one big shift in the last decade: Spanner / CockroachDB make CP across regions practical thanks to synchronised clocks and Raft. You no longer have to give up consistency to scale geographically — you just pay a few extra ms.
6. Choosing consistency per operation, not per system
A mature system picks consistency per call:
- Payments capture → linearisable (consensus).
- User profile read → strong leader-read.
- Friend count → eventual (off by 1 doesn’t matter).
- Notifications fanout → eventual + idempotent.
- Ledger writes → strong + transactional.
- Activity feed → eventual + bounded staleness.
DynamoDB, Cosmos, Cassandra all expose per-call consistency knobs precisely because one number for the whole system is wrong.
7. Sequence — a CP write with quorum
client ── write X ──▶ leader
leader ──── X ──▶ follower 1
leader ──── X ──▶ follower 2
leader ──── X ──▶ follower 3 (slow / partitioned)
leader waits for majority (2 of 3 followers) + self = 3 of 4
leader commits, replies OK
If two followers can't be reached: the leader stalls writes (CP). Cassandra at LOCAL_QUORUM would still serve if local DC quorum is up — same trade made differently.
8. Failures you should be ready to talk about
- Split brain — two nodes both think they're leader. Cured by Raft’s majority + term number.
- Stale reads after failover — if you read from a follower right after a leader change. Use read-index / leader-lease.
- Network partition — minority side stops serving writes (CP) or keeps serving and reconciles later (AP).
- Clock skew — breaks pseudo-orderings; Spanner uses TrueTime, CockroachDB uses hybrid logical clocks.
- Slow replica — raises tail latency in W=all setups. Mitigate with majority quorum + read-repair.
9. Hands-on (30 min)
Spin up a 3-node etcd cluster in Docker and play with it:
docker network create etcd
for i in 1 2 3; do
docker run -d --network etcd --name etcd$i quay.io/coreos/etcd:v3.5 \
/usr/local/bin/etcd \
--name etcd$i \
--initial-cluster etcd1=http://etcd1:2380,etcd2=http://etcd2:2380,etcd3=http://etcd3:2380 \
--initial-advertise-peer-urls http://etcd$i:2380 \
--listen-peer-urls http://0.0.0.0:2380 \
--advertise-client-urls http://etcd$i:2379 \
--listen-client-urls http://0.0.0.0:2379
done
Now:
etcdctl put /k vand read from any node — linearisable.docker pause etcd1(kill the leader) and watch failover; writes resume in <1 s.- Pause 2 of 3; writes stall (CP).
Feel the trade-off in real time. It’ll change how you talk about it in interviews.
10. What's next
This session is foundational — every later SYS session (chat, news feed, search, distributed job queue) will lean on these terms. From here you can read papers like Raft, Paxos Made Simple, Spanner, and Dynamo without bouncing off the abstract.
Reading material
Books:
- Designing Data-Intensive Applications — Martin Kleppmann (ch. 5: Replication, ch. 8: The Trouble with Distributed Systems, ch. 9: Consistency and Consensus — the entire core of this session)
- Database Internals — Alex Petrov (Part II on distributed systems)
- Distributed Systems: Principles and Paradigms — Tanenbaum, Van Steen (the textbook)
Papers:
- Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services (Gilbert & Lynch, 2002) — the proof of CAP.
- Problems with CAP, and Yahoo's Little Known NoSQL System (Daniel Abadi) — the original PACELC proposal.
- CAP Twelve Years Later: How the 'Rules' Have Changed (Eric Brewer, 2012) — Brewer's own retrospective.
- Spanner: Google's Globally-Distributed Database (OSDI 2012) — TrueTime, the case for CP at planet scale.
Official docs:
Blog posts:
- Jepsen — distributed system safety analyses — Kyle Kingsbury's tests of every major system; the empirical view.
- You Can't Sacrifice Partition Tolerance — codahale.com — the classic essay arguing CAP is misunderstood.
In-depth research material
- jepsen — github.com/jepsen-io/jepsen — the test framework behind the Jepsen analyses.
- Distributed Systems Reading List — github.com/theanalyst/awesome-distributed-systems — ~10k ★, curated papers.
- CMU 15-440 Distributed Systems — course page — assignments + lecture notes.
- MIT 6.824 Distributed Systems — course page (Robert Morris) — the gold-standard university course; lectures on YouTube.
- Software Engineering Daily — Kleppmann episode
- Latent Space — Spanner deep dive
Videos
- Distributed Systems 1.1: Introduction — Martin Kleppmann (Cambridge) · 15 min — the start of the best distributed-systems lecture series on the internet, free.
- Distributed Systems Course — full 6-hour lecture series — Scientific Programming School · 6 h 23 min — Kleppmann's Cambridge series concatenated; jump to the consistency chapters.
- CAP Theorem Simplified — ByteByteGo · 6 min — the 5-min C/A/P refresher; Alex Xu's signature animation style.
- Designing Data-intensive Applications with Martin Kleppmann — The Pragmatic Engineer · 1 h 26 min — the man himself on the book; covers CAP, consensus, replication.
- The Two Generals' Problem — Tom Scott · 8 min — the only watchable explanation of why "reliable messaging over an unreliable network" is provably impossible. 7.8M views; mandatory.
LeetCode — Verifying An Alien Dictionary
- Link: https://leetcode.com/problems/verifying-an-alien-dictionary/
- Difficulty: Medium
- Why this problem: Build char→rank map from order; check each adjacent word pair.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- State CAP precisely — including the partition precondition.
- Explain why PACELC is a more useful framing day-to-day.
- Compute quorum requirements for given N, R, W.
- Draw the Raft leader-election + AppendEntries flow.
- Place 5 real systems on the CAP/PACELC map.
- Solve
verifying-an-alien-dictionary— partial-order reasoning, same shape as CRDT merge ordering.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.