system designadvanced 12m2026-06-02

Day 04 — Designing a URL Shortener at Scale — IDs, Storage, Cache, CDN

The classic warm-up for every system design loop. It exercises ID generation, key-value modelling, caching, hot-key handling, analytics and CDN edge — all trans…

TinyURL looks trivial until you put numbers on it. Let's design for 100M new URLs/day, 10B reads/day, 5-year retention and see how each layer falls out.

1. Back-of-envelope

Writes: 100M/day ÷ 86400 ≈ 1.2k QPS, peak 5×: 6k QPS.
Reads: 10B/day ÷ 86400 ≈ 115k QPS, peak 5×: 600k QPS. Read:write ≈ 100:1.
Storage: 100M × 365 × 5 ≈ 180B records. Each record ≈ 500B (short, long, owner, created_at, expires_at, click_count) → 90 TB. Add 2-3× replication.

2. ID / short code scheme

Three real options:

Hash(long_url)[:7] in base62: deterministic, idempotent (same long → same short), but collisions need check-and-retry against an index.
Auto-increment counter + base62: clean (no collisions), but a single counter is a bottleneck. Mitigate with batch allocation (each app server reserves 10k IDs at once) or per-shard counters.
Snowflake-style 64-bit IDs (timestamp | machine_id | seq): coordination-free, sortable by time, 64 bits → 11 chars base62 (too long). Truncate to 48 bits → 8 chars.

For 100M/day I'd pick counter with batched allocation: it's the simplest, gives the shortest codes (6-7 chars), and the bottleneck is solved with batching. Hashing is nice for idempotency but you'd still need a uniqueness check on collision and that's a round trip.

3. Storage

Access pattern is point-lookup by short code. This is the textbook KV use-case:

DynamoDB / Cassandra / Bigtable: hash-partition by short_code → O(1) lookup, horizontal scale, fine for 90 TB.
Postgres: works up to a point, but at 600k read QPS one will need read replicas + Citus-style sharding. Operationally heavier.

Schema:

short_code (pk) | long_url | owner_id | created_at | expires_at | meta

🛠 Deep Dive

Internals, code, architecture.

4. Cache layer — the workhorse

600k read QPS hitting a database directly is painful and expensive. With a 90/10 Zipf distribution, a Redis cluster in front gives ~95% hit rate.

TTL: 24h, with stale-while-revalidate.
Hot key problem: a viral link can hit 50k QPS on one Redis shard. Mitigate with client-side caching (HTTP Cache-Control: public, max-age=86400) plus CDN at the edge doing the 302.

5. Write path

Validate URL (length, scheme, blocklist).
Idempotency: if hash(long_url, owner) already exists for this owner, return existing.
Get next short_code from local batch (refill from counter shard via INCRBY 10000).
Insert into KV store (conditional put, in case of race).
Async publish to Kafka for analytics.

6. Analytics — never block the redirect

On every resolve, fire an async event to Kafka (\{short_code, ts, ua, ip→geo, referer\}). A Flink / Spark Structured Streaming job aggregates per-minute counts into a time-series store (ClickHouse / Druid) for dashboards. Never write to the analytics DB on the hot path — it'll add 50ms and cost you 50% of capacity.

🚀 In Practice

Trade-offs, exercises, what to ship today.

7. Abuse, security, privacy

Phishing / malware: integrate with Google Safe Browsing on write; periodic re-scan.
PII in URLs: hash for analytics, drop query strings with tokens.
Rate limiting: per-IP and per-API-key on writes (e.g. 100/min).
Custom aliases: extra uniqueness check, ban reserved words.

8. Multi-region

Reads are dominantly edge-cached, so a single write region with global read replicas works to ~1B URLs. Beyond that, partition by short_code prefix across regions and route at the edge. Conflict-free because short_codes are unique by construction.

9. Failure modes

Counter shard down → batched IDs in app caches keep writes alive for minutes; alert and re-elect.
Redis cluster split → DB absorbs load briefly; circuit-breaker to degrade gracefully (return cached 302 with stale flag).
CDN poisoning → sign 302 responses with short HMAC.

10. What to take away

Order matters: numbers → API → data model → write path → read path → cache/CDN → analytics → failure → trade-offs. State assumptions before each decision. Pick one option, defend it, then mention the alternative.

Key points

Resources

Practice Problem: Encode and Decode TinyURL (Medium)

← previous

Day 03 — Gradient Boosted Trees — XGBoost / LightGBM, Loss, Regularisation

Day 05 — SOLID Principles + Strategy / Factory / Observer Patterns in Python