Search Tech Journey

Find topics, journeys and posts

back to blog
system designintermediate 12m2026-06-09

CAP, PACELC, Quorums — How Distributed Systems Actually Trade Off

Session 12 of the 48-session learning series.

Why this session matters

This is Session 12 of 48 in the System Design track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.

Agenda

  • CAP — what it actually says (and what it doesn’t)
  • PACELC — the 'else' Brewer left out and why it matters in practice
  • Consensus 101 — Raft in one diagram + the role of quorums
  • Real systems on the CAP map — Dynamo, Cassandra, Spanner, Postgres
  • Practical guidance — pick consistency at the operation level, not the system level

Pre-read (skim before the session)

Deep dive

1. CAP — what it actually says

Brewer's CAP theorem: in the presence of a network partition, you must choose between consistency (every read sees the latest write) and availability (every request gets a non-error response).

What people get wrong:

  • CAP is not a system-wide property. It applies per operation, per partition event.
  • "AP" doesn’t mean inconsistent forever; it means you'll eventually converge.
  • "CP" doesn’t mean infinitely consistent; it means you reject writes when you can't guarantee it.

During normal operation (no partition), modern systems are both consistent AND available. CAP only fires during a partition.

2. PACELC — the 'else' clause

Abadi 2012 extended CAP:

If there is a Partition, trade Availability vs Consistency. Else (during normal operation), trade Latency vs Consistency.

This is closer to how engineers actually feel the trade-off. Linearisable systems pay latency for every write (cross-region round-trips, consensus rounds); eventually-consistent systems are fast all the time but converge later.

  Partition?       Choose
  ----------       ------
     yes    ———▶   A  or  C
      no    ———▶   L  or  C

3. Quorums

A quorum is the minimum number of replicas that must respond for a read or write. For N replicas, common settings:

  • W = writes acked by W replicas
  • R = reads gather R replicas
  • Strong consistency if W + R > N (writes and reads always overlap).

Examples:

  • N=3, W=3, R=1 — strong, slow writes, fast reads. Postgres replication w/ sync replica.
  • N=3, W=2, R=2 — strong, balanced. Cassandra QUORUM.
  • N=3, W=1, R=1 — fast, eventually consistent. Cassandra ONE.

4. Raft in one diagram

Raft is the most popular consensus algorithm in production (etcd, Consul, TiKV, CockroachDB, KRaft, MongoDB internals).

All nodes hold a replicated log.
One leader at a time.

  Client → Leader: "append entry X"
  Leader → Followers: AppendEntries(X)
  Followers respond OK
  Once majority OK, leader commits, applies to state machine,
  responds to client.

Elections:
  Followers heartbeat-timeout → candidate → request votes →
  if majority votes, becomes leader.

Properties:

  • Safety — at most one leader per term; committed entries never lost.
  • Liveness — makes progress as long as a majority is up and can talk.
  • Linearisability — the log defines a single global order of operations.

Failure handling: leader dies → election (~150–300 ms typical) → new leader; clients retry.

5. Real systems on the CAP/PACELC map

SystemDefault consistency modelNotes
Postgres (single)Strong + ACIDNot distributed in this mode
Postgres + sync replicaStrong (CP, sync writes)Pays latency to one replica
CassandraTunable (per-op consistency level)AP at QUORUM with N=3, R=1, W=1 — eventual
DynamoDBEventual by default; consistent reads opt-inAP/CP per-op
MongoDB (replica set)Linearisable reads if you opt inDefaults to majority reads
etcd / Consul / ZookeeperStrong, linearisable (Raft / Zab)CP
Spanner / CockroachDBStrong, externally consistent (TrueTime / HLC + Raft)CP, low-latency multi-region
KafkaPer-partition ordered + replicatedStrong within a partition; configurable on losses

The one big shift in the last decade: Spanner / CockroachDB make CP across regions practical thanks to synchronised clocks and Raft. You no longer have to give up consistency to scale geographically — you just pay a few extra ms.

6. Choosing consistency per operation, not per system

A mature system picks consistency per call:

  • Payments capture → linearisable (consensus).
  • User profile read → strong leader-read.
  • Friend count → eventual (off by 1 doesn’t matter).
  • Notifications fanout → eventual + idempotent.
  • Ledger writes → strong + transactional.
  • Activity feed → eventual + bounded staleness.

DynamoDB, Cosmos, Cassandra all expose per-call consistency knobs precisely because one number for the whole system is wrong.

7. Sequence — a CP write with quorum

client ── write X ──▶ leader
                     leader ──── X ──▶ follower 1
                     leader ──── X ──▶ follower 2
                     leader ──── X ──▶ follower 3 (slow / partitioned)
                     leader waits for majority (2 of 3 followers) + self = 3 of 4
                     leader commits, replies OK

If two followers can't be reached: the leader stalls writes (CP). Cassandra at LOCAL_QUORUM would still serve if local DC quorum is up — same trade made differently.

8. Failures you should be ready to talk about

  • Split brain — two nodes both think they're leader. Cured by Raft’s majority + term number.
  • Stale reads after failover — if you read from a follower right after a leader change. Use read-index / leader-lease.
  • Network partition — minority side stops serving writes (CP) or keeps serving and reconciles later (AP).
  • Clock skew — breaks pseudo-orderings; Spanner uses TrueTime, CockroachDB uses hybrid logical clocks.
  • Slow replica — raises tail latency in W=all setups. Mitigate with majority quorum + read-repair.

9. Hands-on (30 min)

Spin up a 3-node etcd cluster in Docker and play with it:

docker network create etcd
for i in 1 2 3; do
  docker run -d --network etcd --name etcd$i quay.io/coreos/etcd:v3.5 \
    /usr/local/bin/etcd \
    --name etcd$i \
    --initial-cluster etcd1=http://etcd1:2380,etcd2=http://etcd2:2380,etcd3=http://etcd3:2380 \
    --initial-advertise-peer-urls http://etcd$i:2380 \
    --listen-peer-urls http://0.0.0.0:2380 \
    --advertise-client-urls http://etcd$i:2379 \
    --listen-client-urls http://0.0.0.0:2379
done

Now:

  • etcdctl put /k v and read from any node — linearisable.
  • docker pause etcd1 (kill the leader) and watch failover; writes resume in <1 s.
  • Pause 2 of 3; writes stall (CP).

Feel the trade-off in real time. It’ll change how you talk about it in interviews.

10. What's next

This session is foundational — every later SYS session (chat, news feed, search, distributed job queue) will lean on these terms. From here you can read papers like Raft, Paxos Made Simple, Spanner, and Dynamo without bouncing off the abstract.

Reading material

Books:

  • Designing Data-Intensive Applications — Martin Kleppmann (ch. 5: Replication, ch. 8: The Trouble with Distributed Systems, ch. 9: Consistency and Consensus — the entire core of this session)
  • Database Internals — Alex Petrov (Part II on distributed systems)
  • Distributed Systems: Principles and Paradigms — Tanenbaum, Van Steen (the textbook)

Papers:

Official docs:

Blog posts:

In-depth research material

Videos

LeetCode — Verifying An Alien Dictionary

Post-session checklist

By the end of this session you should be able to:

  • State CAP precisely — including the partition precondition.
  • Explain why PACELC is a more useful framing day-to-day.
  • Compute quorum requirements for given N, R, W.
  • Draw the Raft leader-election + AppendEntries flow.
  • Place 5 real systems on the CAP/PACELC map.
  • Solve verifying-an-alien-dictionary — partial-order reasoning, same shape as CRDT merge ordering.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.