Search Tech Journey

Find topics, journeys and posts

back to blog
system designadvanced 12m2026-06-02

Day 04 — Designing a URL Shortener at Scale — IDs, Storage, Cache, CDN

The classic warm-up for every system design loop. It exercises ID generation, key-value modelling, caching, hot-key handling, analytics and CDN edge — all trans…

TinyURL looks trivial until you put numbers on it. Let's design for 100M new URLs/day, 10B reads/day, 5-year retention and see how each layer falls out.

🧠 Concept

Why it matters & the mental model.

1. Back-of-envelope

  • Writes: 100M/day ÷ 86400 ≈ 1.2k QPS, peak 5×: 6k QPS.
  • Reads: 10B/day ÷ 86400 ≈ 115k QPS, peak 5×: 600k QPS. Read:write ≈ 100:1.
  • Storage: 100M × 365 × 5 ≈ 180B records. Each record ≈ 500B (short, long, owner, created_at, expires_at, click_count) → 90 TB. Add 2-3× replication.

2. ID / short code scheme

Three real options:

  1. Hash(long_url)[:7] in base62: deterministic, idempotent (same long → same short), but collisions need check-and-retry against an index.
  2. Auto-increment counter + base62: clean (no collisions), but a single counter is a bottleneck. Mitigate with batch allocation (each app server reserves 10k IDs at once) or per-shard counters.
  3. Snowflake-style 64-bit IDs (timestamp | machine_id | seq): coordination-free, sortable by time, 64 bits → 11 chars base62 (too long). Truncate to 48 bits → 8 chars.

For 100M/day I'd pick counter with batched allocation: it's the simplest, gives the shortest codes (6-7 chars), and the bottleneck is solved with batching. Hashing is nice for idempotency but you'd still need a uniqueness check on collision and that's a round trip.

3. Storage

Access pattern is point-lookup by short code. This is the textbook KV use-case:

  • DynamoDB / Cassandra / Bigtable: hash-partition by short_code → O(1) lookup, horizontal scale, fine for 90 TB.
  • Postgres: works up to a point, but at 600k read QPS one will need read replicas + Citus-style sharding. Operationally heavier.

Schema:

short_code (pk) | long_url | owner_id | created_at | expires_at | meta

🛠 Deep Dive

Internals, code, architecture.

4. Cache layer — the workhorse

600k read QPS hitting a database directly is painful and expensive. With a 90/10 Zipf distribution, a Redis cluster in front gives ~95% hit rate.

  • TTL: 24h, with stale-while-revalidate.
  • Hot key problem: a viral link can hit 50k QPS on one Redis shard. Mitigate with client-side caching (HTTP Cache-Control: public, max-age=86400) plus CDN at the edge doing the 302.

5. Write path

  1. Validate URL (length, scheme, blocklist).
  2. Idempotency: if hash(long_url, owner) already exists for this owner, return existing.
  3. Get next short_code from local batch (refill from counter shard via INCRBY 10000).
  4. Insert into KV store (conditional put, in case of race).
  5. Async publish to Kafka for analytics.

6. Analytics — never block the redirect

On every resolve, fire an async event to Kafka (\{short_code, ts, ua, ip→geo, referer\}). A Flink / Spark Structured Streaming job aggregates per-minute counts into a time-series store (ClickHouse / Druid) for dashboards. Never write to the analytics DB on the hot path — it'll add 50ms and cost you 50% of capacity.

🚀 In Practice

Trade-offs, exercises, what to ship today.

7. Abuse, security, privacy

  • Phishing / malware: integrate with Google Safe Browsing on write; periodic re-scan.
  • PII in URLs: hash for analytics, drop query strings with tokens.
  • Rate limiting: per-IP and per-API-key on writes (e.g. 100/min).
  • Custom aliases: extra uniqueness check, ban reserved words.

8. Multi-region

Reads are dominantly edge-cached, so a single write region with global read replicas works to ~1B URLs. Beyond that, partition by short_code prefix across regions and route at the edge. Conflict-free because short_codes are unique by construction.

9. Failure modes

  • Counter shard down → batched IDs in app caches keep writes alive for minutes; alert and re-elect.
  • Redis cluster split → DB absorbs load briefly; circuit-breaker to degrade gracefully (return cached 302 with stale flag).
  • CDN poisoning → sign 302 responses with short HMAC.

10. What to take away

Order matters: numbers → API → data model → write path → read path → cache/CDN → analytics → failure → trade-offs. State assumptions before each decision. Pick one option, defend it, then mention the alternative.

Key points

    Resources

    Practice Problem: Encode and Decode TinyURL (Medium)