system designintermediate 12m2026-06-09

CAP, PACELC, Quorums — How Distributed Systems Actually Trade Off

Session 12 of the 48-session learning series.

Why this session matters

This is Session 12 of 48 in the System Design track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

CAP — what it actually says (and what it doesn’t)
PACELC — the 'else' Brewer left out and why it matters in practice
Consensus 101 — Raft in one diagram + the role of quorums
Real systems on the CAP map — Dynamo, Cassandra, Spanner, Postgres
Practical guidance — pick consistency at the operation level, not the system level

Pre-read (skim before the session)

Brewer's CAP theorem: in the presence of a network partition, you must choose between consistency (every read sees the latest write) and availability (every request gets a non-error response).

What people get wrong:

CAP is not a system-wide property. It applies per operation, per partition event.
"AP" doesn’t mean inconsistent forever; it means you'll eventually converge.
"CP" doesn’t mean infinitely consistent; it means you reject writes when you can't guarantee it.

During normal operation (no partition), modern systems are both consistent AND available. CAP only fires during a partition.

2. PACELC — the 'else' clause

Abadi 2012 extended CAP:

If there is a Partition, trade Availability vs Consistency. Else (during normal operation), trade Latency vs Consistency.

This is closer to how engineers actually feel the trade-off. Linearisable systems pay latency for every write (cross-region round-trips, consensus rounds); eventually-consistent systems are fast all the time but converge later.

  Partition?       Choose
  ----------       ------
     yes    ———▶   A  or  C
      no    ———▶   L  or  C

3. Quorums

A quorum is the minimum number of replicas that must respond for a read or write. For N replicas, common settings:

W = writes acked by W replicas
R = reads gather R replicas
Strong consistency if W + R > N (writes and reads always overlap).

Examples:

N=3, W=3, R=1 — strong, slow writes, fast reads. Postgres replication w/ sync replica.
N=3, W=2, R=2 — strong, balanced. Cassandra QUORUM.
N=3, W=1, R=1 — fast, eventually consistent. Cassandra ONE.

4. Raft in one diagram

Raft is the most popular consensus algorithm in production (etcd, Consul, TiKV, CockroachDB, KRaft, MongoDB internals).

All nodes hold a replicated log.
One leader at a time.

  Client → Leader: "append entry X"
  Leader → Followers: AppendEntries(X)
  Followers respond OK
  Once majority OK, leader commits, applies to state machine,
  responds to client.

Elections:
  Followers heartbeat-timeout → candidate → request votes →
  if majority votes, becomes leader.

Properties:

Safety — at most one leader per term; committed entries never lost.
Liveness — makes progress as long as a majority is up and can talk.
Linearisability — the log defines a single global order of operations.

Failure handling: leader dies → election (~150–300 ms typical) → new leader; clients retry.

5. Real systems on the CAP/PACELC map

System	Default consistency model	Notes
Postgres (single)	Strong + ACID	Not distributed in this mode
Postgres + sync replica	Strong (CP, sync writes)	Pays latency to one replica
Cassandra	Tunable (per-op consistency level)	AP at QUORUM with N=3, R=1, W=1 — eventual
DynamoDB	Eventual by default; consistent reads opt-in	AP/CP per-op
MongoDB (replica set)	Linearisable reads if you opt in	Defaults to majority reads
etcd / Consul / Zookeeper	Strong, linearisable (Raft / Zab)	CP
Spanner / CockroachDB	Strong, externally consistent (TrueTime / HLC + Raft)	CP, low-latency multi-region
Kafka	Per-partition ordered + replicated	Strong within a partition; configurable on losses

The one big shift in the last decade: Spanner / CockroachDB make CP across regions practical thanks to synchronised clocks and Raft. You no longer have to give up consistency to scale geographically — you just pay a few extra ms.

6. Choosing consistency per operation, not per system

A mature system picks consistency per call:

Payments capture → linearisable (consensus).
User profile read → strong leader-read.
Friend count → eventual (off by 1 doesn’t matter).
Notifications fanout → eventual + idempotent.
Ledger writes → strong + transactional.
Activity feed → eventual + bounded staleness.

DynamoDB, Cosmos, Cassandra all expose per-call consistency knobs precisely because one number for the whole system is wrong.

7. Sequence — a CP write with quorum

client ── write X ──▶ leader
                     leader ──── X ──▶ follower 1
                     leader ──── X ──▶ follower 2
                     leader ──── X ──▶ follower 3 (slow / partitioned)
                     leader waits for majority (2 of 3 followers) + self = 3 of 4
                     leader commits, replies OK

If two followers can't be reached: the leader stalls writes (CP). Cassandra at LOCAL_QUORUM would still serve if local DC quorum is up — same trade made differently.

8. Failures you should be ready to talk about

Split brain — two nodes both think they're leader. Cured by Raft’s majority + term number.
Stale reads after failover — if you read from a follower right after a leader change. Use read-index / leader-lease.
Network partition — minority side stops serving writes (CP) or keeps serving and reconciles later (AP).
Clock skew — breaks pseudo-orderings; Spanner uses TrueTime, CockroachDB uses hybrid logical clocks.
Slow replica — raises tail latency in W=all setups. Mitigate with majority quorum + read-repair.

9. Hands-on (30 min)

Spin up a 3-node etcd cluster in Docker and play with it:

docker network create etcd
for i in 1 2 3; do
  docker run -d --network etcd --name etcd$i quay.io/coreos/etcd:v3.5 \
    /usr/local/bin/etcd \
    --name etcd$i \
    --initial-cluster etcd1=http://etcd1:2380,etcd2=http://etcd2:2380,etcd3=http://etcd3:2380 \
    --initial-advertise-peer-urls http://etcd$i:2380 \
    --listen-peer-urls http://0.0.0.0:2380 \
    --advertise-client-urls http://etcd$i:2379 \
    --listen-client-urls http://0.0.0.0:2379
done

Now:

etcdctl put /k v and read from any node — linearisable.
docker pause etcd1 (kill the leader) and watch failover; writes resume in <1 s.
Pause 2 of 3; writes stall (CP).

Feel the trade-off in real time. It’ll change how you talk about it in interviews.

10. What's next

This session is foundational — every later SYS session (chat, news feed, search, distributed job queue) will lean on these terms. From here you can read papers like Raft, Paxos Made Simple, Spanner, and Dynamo without bouncing off the abstract.

Reading material

Books:

Designing Data-Intensive Applications — Martin Kleppmann (ch. 5: Replication, ch. 8: The Trouble with Distributed Systems, ch. 9: Consistency and Consensus — the entire core of this session)
Database Internals — Alex Petrov (Part II on distributed systems)
Distributed Systems: Principles and Paradigms — Tanenbaum, Van Steen (the textbook)

Papers:

Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services (Gilbert & Lynch, 2002) — the proof of CAP.
Problems with CAP, and Yahoo's Little Known NoSQL System (Daniel Abadi) — the original PACELC proposal.
CAP Twelve Years Later: How the 'Rules' Have Changed (Eric Brewer, 2012) — Brewer's own retrospective.
Spanner: Google's Globally-Distributed Database (OSDI 2012) — TrueTime, the case for CP at planet scale.

Official docs:

Blog posts:

Jepsen — distributed system safety analyses — Kyle Kingsbury's tests of every major system; the empirical view.
You Can't Sacrifice Partition Tolerance — codahale.com — the classic essay arguing CAP is misunderstood.

In-depth research material

jepsen — github.com/jepsen-io/jepsen — the test framework behind the Jepsen analyses.
Distributed Systems Reading List — github.com/theanalyst/awesome-distributed-systems — ~10k ★, curated papers.
CMU 15-440 Distributed Systems — course page — assignments + lecture notes.
MIT 6.824 Distributed Systems — course page (Robert Morris) — the gold-standard university course; lectures on YouTube.
Software Engineering Daily — Kleppmann episode
Latent Space — Spanner deep dive

Videos

Distributed Systems 1.1: Introduction — Martin Kleppmann (Cambridge) · 15 min — the start of the best distributed-systems lecture series on the internet, free.
Distributed Systems Course — full 6-hour lecture series — Scientific Programming School · 6 h 23 min — Kleppmann's Cambridge series concatenated; jump to the consistency chapters.
CAP Theorem Simplified — ByteByteGo · 6 min — the 5-min C/A/P refresher; Alex Xu's signature animation style.
Designing Data-intensive Applications with Martin Kleppmann — The Pragmatic Engineer · 1 h 26 min — the man himself on the book; covers CAP, consensus, replication.
The Two Generals' Problem — Tom Scott · 8 min — the only watchable explanation of why "reliable messaging over an unreliable network" is provably impossible. 7.8M views; mandatory.

LeetCode — Verifying An Alien Dictionary

Link: https://leetcode.com/problems/verifying-an-alien-dictionary/
Difficulty: Medium
Why this problem: Build char→rank map from order; check each adjacent word pair.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

State CAP precisely — including the partition precondition.
Explain why PACELC is a more useful framing day-to-day.
Compute quorum requirements for given N, R, W.
Draw the Raft leader-election + AppendEntries flow.
Place 5 real systems on the CAP/PACELC map.
Solve verifying-an-alien-dictionary — partial-order reasoning, same shape as CRDT merge ordering.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

SOLID Part 2 — ISP, DIP, and Design Patterns (Strategy, Factory, Observer)

RAG Part 2 — Retrieval, Re-Ranking, Generation, Evaluation