Search Tech Journey

Find topics, journeys and posts

back to blog
system designadvanced 12m2026-06-12

Day 14 — Sharding, Replication & Multi-Region Databases

The moment one database can't hold your data, you shard. The moment one region can't serve your users, you go multi-region. Both decisions cascade into every ot…

Sharding distributes data across nodes by partition key. Replication keeps copies across nodes/regions. Together they buy you scale and availability — but every choice locks in trade-offs that are painful to reverse.

🧠 Concept

Why it matters & the mental model.

1. Sharding strategies

  • Range: partition by sorted key range (e.g. user_id 0-999, 1000-1999). Great for range scans, hot spots if data is skewed.
  • Hash: partition by hash(key) % N. Even distribution, but range scans become scatter-gather.
  • Directory / consistent hashing: hash to a virtual node, virtual nodes mapped to physical nodes by a lookup table → can rebalance without rehashing the world (Dynamo, Cassandra, Riak).
  • Geographic: shard by region (data lives near users) → low latency, complex cross-region queries.

2. The shard key — your single biggest decision

A good shard key:

  • High cardinality (millions+ of distinct values).
  • Even distribution of access (no celebrity keys).
  • Common in queries (so most queries hit one shard).
  • Stable (doesn't change for a row, else moves between shards).

User_id is great for B2C. For B2B, tenant_id often beats user_id because cross-user queries within a tenant are common.

3. Replication topologies

  • Single leader: simple, linearisable writes via leader. Most RDBMS, MongoDB.
  • Multi-leader: writes anywhere, async replication, conflicts to resolve. Used in multi-region setups when sync replication latency is unacceptable.
  • Leaderless (Dynamo, Cassandra, Riak): any replica accepts writes, R+W>N quorums, anti-entropy + read repair to converge.

🛠 Deep Dive

Internals, code, architecture.

4. Sync vs async, region-by-region

  • Sync within region (sub-ms): fine — Spanner does this with Paxos in each region.
  • Sync across regions (50-200 ms RTT): every write pays that latency. Only acceptable for low-write workloads (configuration, leases).
  • Async cross-region: cheap but means RPO > 0 (you can lose recent writes if the region dies).

5. Multi-region patterns

  • Active-passive: one write region, others read-only / standby. Simple, failover is a deliberate event.
  • Active-active region-pinned: each user's data has a "home region" (sharded by region). Writes always local. Cross-region writes need 2-phase coordination.
  • Active-active globally consistent (Spanner, CockroachDB, FaunaDB): synchronous global writes via Paxos + TrueTime / HLC. Expensive, beautiful.

6. Resharding — the dragon

Adding shards when you outgrow N is painful with hash-mod sharding (almost every key moves). Mitigations:

  • Consistent hashing + virtual nodes: only 1/N keys move.
  • Logical → physical indirection: many logical shards mapped to few physical; split a physical when hot.
  • Vitess: online split with double-write + cutover.

🚀 In Practice

Trade-offs, exercises, what to ship today.

7. Cross-shard queries

The dirty secret: most apps eventually need joins across shards. Options:

  • Denormalise at write time (Cassandra style).
  • Scatter-gather from app: slow, but explicit.
  • Fan-out via async pipeline (CDC → search index) for analytic queries.
  • Distributed SQL (CockroachDB, Yugabyte, TiDB): join across shards transparently with cost.

8. Consistency knobs per query

Per-query consistency is a feature: SELECT … READ COMMITTED vs SELECT … FOR UPDATE, Cassandra consistency_level=ONE/QUORUM/ALL, Dynamo ConsistentRead=True. Use linearisable reads only where required (balance, auth); use cheaper reads everywhere else.

9. Disaster recovery

Define RPO (data loss budget, e.g. 30 s) and RTO (downtime budget, e.g. 5 min). They drive replication mode and rehearsal frequency. A DR plan you've never tested doesn't work.

10. What to take away

Always state shard key, replication mode, write region routing, and consistency per use case. The senior tell: name a real failure mode ("if region A is split-brained from B, leader election may make A think it's still leader for 30s — we accept that RPO").

Key points

    Resources

    Practice Problem: Design Twitter (Medium)