Search Tech Journey

Find topics, journeys and posts

back to blog
system designadvanced 12m2026-06-07

Day 09 — CAP, PACELC, Consensus — Raft, Quorums, and Realistic Trade-offs

Every distributed system has to make a CAP-style call. Understanding Raft / Paxos and quorum reads/writes is what lets engineers reason precisely instead of wav…

CAP gets all the airtime but is the least useful version of the trade-off. PACELC is the one you actually need at design time: if Partition then Availability vs Consistency, Else Latency vs Consistency. Most production systems live in the "Else" branch 99.99% of the time.

🧠 Concept

Why it matters & the mental model.

1. Definitions, precisely

  • Consistency (CAP-C, linearizability): every read sees the most recent write — as if there were a single copy.
  • Availability: every request gets a non-error response (no waiting for a quorum that's gone).
  • Partition tolerance: system keeps functioning despite network splits. You always have P in practice → so under partition you must pick A or C.

2. PACELC examples

The choice between consistency and availability shows up in real databases as concrete defaults.

3. Raft — consensus you can actually understand

Raft elects a single leader; all writes go through it. Leader appends to its log and replicates to followers; an entry is committed once a majority have stored it.

Key invariants:

  • Election safety: at most one leader per term.
  • Log matching: if two logs share an entry at the same index+term, all prior entries match.
  • State machine safety: once a log entry is applied, no other entry will ever be applied at that index.

4. Leader election

Each node has a term (monotonic). On timeout (~150-300 ms randomised), a follower becomes candidate, increments term, requests votes. A node votes once per term, only if the candidate's log is at least as up-to-date. Candidate with majority votes → leader. Randomised timeouts kill split votes.

🛠 Deep Dive

Internals, code, architecture.

5. Quorums and the magic of N/2+1

Why majority? Two quorums of size > N/2 must intersect → at least one node has the latest committed write. R+W>N (Dynamo-style) is the generalisation: pick R, W so any read and write quorum overlap.

  • R=W=N/2+1: linearisable, balanced.
  • W=N, R=1: fast reads, fragile writes.
  • W=1, R=N: opposite.

6. Paxos, Multi-Paxos, EPaxos, Zab

  • Basic Paxos: prove one value chosen; correct but legendarily hard to read.
  • Multi-Paxos: stream of values, with a stable leader (essentially Raft minus pedagogy).
  • EPaxos: leaderless, commutative writes commit in 1 RTT — used in research / CockroachDB-like systems.
  • Zab: ZooKeeper's protocol, very similar to Raft.

7. Snapshots & log compaction

Logs grow forever. Periodically take a state-machine snapshot, truncate log behind it, ship snapshot to lagging followers. Trade-off: snapshot cost vs replay cost on restart.

8. Read paths

Three options with different cost/consistency:

  • Leader read with leadership lease: linearisable, no quorum round trip per read.
  • Read index: leader asks majority "am I still leader?" then serves; linearisable, one RTT.
  • Follower read with bounded staleness: cheaper, returns "x as of timestamp T".

🚀 In Practice

Trade-offs, exercises, what to ship today.

9. Real-world wrinkles

  • Stop-the-world GC can make a leader miss heartbeats → spurious election. Tune GC or use languages without it for critical paths.
  • Network blips (BGP flaps, k8s overlay) trigger elections; watch leader_changes_total.
  • Cross-region majority quorums add inter-region RTT to every write — sometimes you accept it (Spanner), sometimes you partition by region (single-region writes, async replication).

10. Beyond consensus — when you don't need it

Not every write needs linearisability. CRDTs (counters, sets, maps with merge-commutative ops) give eventual consistency without consensus, used in Redis Enterprise, Riak, collaborative editors (Figma, Notion). Pick CRDTs when offline + multi-writer is required.

11. Picking C or A — decision rubric

  • Money / inventory / auth tokens → C (Spanner, etcd, RDBMS with sync replication).
  • Likes / view counts / session tokens → A (Dynamo / Cassandra with eventual reads).
  • Mixed → split your data model; small consistent core + large eventually-consistent periphery.

12. What to take away

"How do you handle a split brain?" Strong answers name the protocol (Raft/Paxos), explain quorum intersection, mention leases for fast reads, and acknowledge GC / clock skew as real-world traps. Bonus: differentiate CAP from PACELC.

Key points

    Resources

    Practice Problem: Number of Islands (Medium)