URL Shortener Part 1 — Numbers, IDs, Storage
Session 3 of the 48-session learning series.
Why this session matters
This is Session 03 of 48 in the System Design track. It builds on the rhythm of one focused topic, paced so you have time to actually absorb it rather than rush.
Agenda
- Back-of-envelope at 100M writes/day, 10B reads/day
- Three ID schemes — hash, counter+base62, Snowflake — trade-offs
- Choosing the storage engine — KV store vs SQL vs NoSQL
- Schema design and idempotency on the write path
- Why the cache + CDN discussion is its own session (Part 2)
Pre-read (skim before the session)
- ByteByteGo — URL Shortener architecture
- Designing Data-Intensive Applications, Ch. 6 (Partitioning)
- HighScalability — Bit.ly architecture notes
- Snowflake ID — Twitter blog (mirrored)
Deep dive
1. The shape of the problem
TinyURL looks trivial until you put numbers on it. Design for 100M new URLs/day, 10B reads/day, 5-year retention and let the layers fall out.
2. Back-of-envelope (always do this first)
- Writes: 100M / 86,400 s ≈ 1.2k QPS average, 6k QPS at 5× peak.
- Reads: 10B / 86,400 s ≈ 115k QPS average, 600k QPS at peak. R:W ≈ 100:1.
- Storage: 100M × 365 × 5 ≈ 180 B records. Each record ≈ 500 B (short, long, owner, created_at, expires_at, click_count) → 90 TB. Plus 2–3× replication.
State these numbers out loud before you talk about anything else. They drive every subsequent decision.
3. ID / short-code scheme — three real options
Option A — Hash(long_url)[:7] in base62
- ✅ Deterministic, idempotent (same long → same short).
- ❌ Collisions need a check-and-retry against an index. Extra round trip.
- ❌ Doesn't help if two users want different shorts for the same URL.
Option B — Auto-increment counter + base62
- ✅ No collisions ever. Shortest codes (6–7 chars for 100B records: log₆₂(1e11) ≈ 6.1).
- ❌ Single counter = single point of contention.
- ✅ Mitigate with batch allocation: each app server reserves 10k IDs at once via
INCRBY counter 10000on Redis. Local cache, lock-free use, refill when low.
Option C — Snowflake (64-bit)
- 41 bits timestamp | 10 bits machine_id | 12 bits sequence → coordination-free, sortable by time.
- Base62 of 64 bits = 11 chars (too long). Truncate to 48 bits → 8 chars.
- ✅ No coordination required across machines.
- ❌ Codes leak creation time and machine_id (mild info leak).
Pick: For 100M/day I'd choose counter with batched allocation (Option B). Simplest, shortest codes, contention solved by batching. State the alternative and why you rejected it.
4. Storage — the access pattern is point-lookup by short code
This is textbook KV: O(1) lookup by single key, no scans.
| Choice | Fit |
|---|---|
| DynamoDB / Cassandra | Hash-partitioned by short_code. Horizontal scale. 90 TB is fine. |
| Bigtable | Same model, GCP-flavoured. |
| Postgres (sharded) | Works to a point. At 600k read QPS you need read replicas + Citus. Operationally heavier. |
| Redis (sole store) | Too expensive at 90 TB and not durable enough for source of truth. |
Pick: DynamoDB (or Cassandra if you self-host). Hash-partition by short_code.
Schema:
short_code STRING PK
long_url STRING
owner_id STRING
created_at TIMESTAMP
expires_at TIMESTAMP
meta MAP<STRING,STRING> -- UTM, custom tags
Add a secondary index on (owner_id, created_at) only if you need owner-scoped listings — they're not on the hot path.
5. Write path
1. Validate URL (length ≤ 2048, scheme http/https, blocklist).
2. Idempotency check: if hash(long_url, owner) already exists → return it.
3. Pop next short_code from local batch (refill via INCRBY 10000 on Redis).
4. Conditional put to DynamoDB (so we catch the very rare race).
5. Async publish to Kafka — analytics consumer aggregates later.
6. Return 201 with short URL.
Two failure modes to think about now:
- Counter shard dies. Batched IDs in app caches keep writes alive for minutes. Alert, re-elect.
- Conditional put fails. Hash collision (rare with 64-bit codes). Increment local counter and retry.
6. Owner / multi-tenant data model
If you want custom aliases (https://tinyurl.com/dinesh-blog) you need:
- Uniqueness across the global namespace.
- A reserved-words blocklist (
admin,api,login, etc.). - Per-owner quota tracking (count + storage).
- A second logical table
aliases (alias_str PK, short_code, owner_id)so resolution still goes through the same KV path.
7. Putting it on the map
┌───────┐ ┌───────────┐
│ User │────▶│ Load bal. │ HTTPS
└───────┘ └────┬──────┘
│
┌──────▼──────┐
│ Shorten │ POST /api/shorten
│ service │ (validates, picks id)
└──────┬──────┘
│ put
┌──────▼──────┐ INCRBY 10000
│ DynamoDB │◀───── Redis counter shard
└──────┬──────┘
│ async event
┌──────▼──────┐
│ Kafka │ → analytics consumer
└─────────────┘
Read path lives in Part 2 (Session 8) — that's where cache, CDN, hot-key handling, and stale-while-revalidate go. Putting both in one session is what made the 28-day plan feel rushed.
8. Numbers you should be able to defend
- Why 6–7 char codes? log₆₂(180B / collision-safety-factor) ≈ 6.5 — pick 7 for headroom.
- Why DynamoDB over Postgres? At 600k read QPS, sharding Postgres is a job in itself. DynamoDB charges by request and storage — predictable.
- Why batched IDs? At 6k peak write QPS with a 10k batch, you refill once every ≈ 1.7 s per app server. Counter QPS =
app_servers / 1.7. - Why conditional put? Race window is microseconds, but at 100M/day even microsecond races happen daily.
9. What we're saving for Part 2 (Session 8)
- Cache layer (Redis cluster), 95% hit-rate math at Zipf 0.9
- CDN edge — doing the 302 at the edge for read-heavy keys
- Hot-key problem — viral short codes
- Abuse / safe-browsing integration
- Analytics — Kafka → Flink → ClickHouse
- Multi-region trade-offs
Reading material
Books:
- System Design Interview Vol. 1 — Alex Xu (ch. 8: Design a URL Shortener)
- Designing Data-Intensive Applications — Martin Kleppmann (chs. 2, 3, 6 for storage & indexing)
- Database Internals — Alex Petrov (ch. on B-trees vs LSM trees — why Postgres vs Cassandra for the mapping table)
Papers:
- Dynamo: Amazon's Highly Available Key-value Store (SOSP 2007) — the lookup pattern at the heart of any KV-backed shortener.
- Bigtable: A Distributed Storage System for Structured Data (OSDI 2006) — wide-column design alternative.
Official docs:
- PostgreSQL —
SERIAL&BIGSERIALreference - Snowflake ID generator (Twitter) — README
- Redis — INCR / INCRBY for counter-based IDs
Blog posts:
- How Bitly works (engineering) — bitly's own writeups.
- Hashids: short, unique, non-sequential IDs — the encoding scheme used by half the industry.
- Designing a URL Shortener — Code Karle — clean writeup with capacity math.
In-depth research material
- Twitter Snowflake — github.com/twitter-archive/snowflake — the 64-bit ID generator everyone copies.
- Sony's sonyflake — github.com/sony/sonyflake — Snowflake-style with longer machine lifetime.
- Instagram Engineering — Sharding & IDs — how Instagram generates IDs across shards.
- Discord — How Discord Stores Billions of Messages — same KV-lookup pattern at scale.
- system-design-primer — github.com/donnemartin/system-design-primer — ~270k ★, includes URL shortener exercises.
- Stripe Engineering — Designing APIs for humans (idempotency keys) — relevant for write idempotency on POST /shorten.
Videos
- Beginner System Design Interview: Design Bitly w/ a Ex-Meta Staff Engineer — Hello Interview · 59 min — the most thorough modern walkthrough; covers numbers, capacity, ID schemes, KV vs SQL.
- Design a URL Shortener (Bitly) — System Design Interview — NeetCodeIO · 48 min — clean, interview-paced, with whiteboard diagrams.
- Tiny URL — System Design Interview Question — TechPrep · 9 min — quick capacity-math refresher; useful as a primer.
- How Does a URL Shortener Work? — ByteByteGo · 6 min — animated overview from Alex Xu's channel.
- TinyURL System Design — codeKarle — codeKarle · 24 min — different framing (capacity → API → DB schema → cache); a useful second perspective.
LeetCode — Encode And Decode Tinyurl
- Link: https://leetcode.com/problems/encode-and-decode-tinyurl/
- Difficulty: Medium
- Why this problem: Counter + base62 keeps codes short; hash-based adds idempotency at cost of collisions.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- State the back-of-envelope numbers (QPS write, QPS read, storage) for the spec.
- Pick an ID scheme and defend the choice; name what you sacrificed.
- Defend the choice of KV store over RDBMS at this scale.
- Sketch the write path including idempotency and async analytics.
- Solve the LeetCode
encode-and-decode-tinyurlproblem using your chosen scheme.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.