Designing a Distributed Job Queue — Reliability, Backoff, Idempotency
Session 39 of the 48-session learning series.
Date: Fri, 2026-07-10 · Time: 18:00–20:00 IST · Track: 🏗️ System Design (SYS) · Parent 28-day topic: Day 07 · Est. read: 2 h
Why this session matters
This is Session 39 of 48 in the System Design track. Every backend eventually needs "do this work later, do it reliably, don't lose it". The patterns — visibility timeout, exponential backoff with jitter, dead letter queue, idempotency — are universal. Knowing them by name is the difference between architecting a queue and reinventing one badly.
Agenda
- The job queue problem — at-least-once vs exactly-once
- Visibility timeout, ack, nack, retry, DLQ — the lifecycle
- Backoff strategies — exponential, jittered, capped; the herd-prevention story
- Idempotency — request key, dedupe table, side-effect quarantine
- Scaling — sharding, partitioning, priority, fairness
Pre-read (skim before the session)
- AWS — Amazon SQS architecture
- Sidekiq — Best practices
- Resque & Sidekiq postmortems — Mike Perham blog
- Marc Brooker — Exponential backoff and jitter
Deep dive
1. What a job queue must do
- Enqueue — accept job; durably persist before ack.
- Dispatch — hand the job to exactly one worker (most of the time).
- Track — keep state (queued, processing, succeeded, failed).
- Retry — re-dispatch failed jobs with backoff.
- Dead-letter — give up after N failures; preserve for diagnosis.
- Observe — depth, age, throughput, failure rate.
Sound simple. The devil is in delivery semantics.
2. Delivery semantics
- At-most-once — fire and forget. Lost on crash. Easy. Almost never what you want.
- At-least-once — guaranteed delivery; duplicates possible. The pragmatic default.
- Exactly-once — guaranteed delivery, no duplicates. Real cost: 2-phase commit, transactions across systems. Rarely worth it.
99% of queues are at-least-once. You make the consumer idempotent (section 7).
3. The job lifecycle
[ enqueued ] → [ in-flight ] → [ succeeded ]
↓ ↓
└→ [ retry (delayed) ] → [ in-flight ]
↓
[ failed N times ] → [ DLQ ]
Each transition is a state change in the queue's storage. Crashes mid-step are the interesting bugs.
4. Visibility timeout
When a worker picks up a job, the queue hides it for X minutes (the visibility timeout). If the worker acks within X, the queue deletes the job. If not (crash, slow worker), the job becomes visible again and another worker picks it up.
Pick X = ~2× worker's worst-case execution time. Too short = duplicate execution. Too long = slow recovery.
Workers can extend their lease (heartbeat) for long jobs. Mandatory for jobs > a few minutes.
5. Retries and backoff
Naive retry: re-enqueue immediately. Bad — if the cause is external (DB down, downstream timeout), 10k workers retry every microsecond, hammer the dependency, never recover.
Exponential backoff:
delay = base * 2 ** attempt
Better — exponential backoff with jitter (Marc Brooker, AWS):
delay = random.uniform(0, base * 2 ** attempt)
Full jitter prevents synchronised retry storms. Always jitter.
Cap the delay (e.g. min(delay, 30min)). Cap the attempts (e.g. 5–10). Then DLQ.
6. Dead-letter queue (DLQ)
After max retries, move the job to a DLQ. Critical because:
- Removes poison messages that would otherwise loop forever.
- Preserves them for diagnosis.
- Lets you replay (after fix) without re-instrumenting.
DLQ hygiene:
- Alert when DLQ depth > 0.
- Tag each message with last-error.
- Have a replay tool.
- Periodically purge after archival.
7. Idempotency — the consumer's job
Because at-least-once means one or more deliveries, your consumer must tolerate duplicates.
Pattern: request key + dedupe table.
def process(job):
if dedupe_table.exists(job.id):
return # already done
with db.transaction():
do_the_work(job)
dedupe_table.insert(job.id, ttl=24h)
commit()
The dedupe insert + work are in one transaction. Either both happen or neither.
External side effects (sending an email, calling Stripe):
- Use the API's idempotency key feature (Stripe
Idempotency-Keyheader). - Or persist "side effect attempted" before the call; on retry, check if it succeeded.
8. Priorities and fairness
A single FIFO queue is rare in prod. Reality:
- High-priority lane (transactional emails) drains fast.
- Bulk lane (nightly recommendations recompute) drains as capacity allows.
- Per-tenant fairness — one customer can't starve others.
Implementations:
- Multiple queues with weighted workers.
- Per-tenant token bucket.
- Token + weighted-fair-queueing.
Don't promise SLA per queue without metering.
9. Scaling the queue
Beyond a single Redis/SQS:
- Partition by job-key or tenant; workers per partition.
- Use Kafka topics with consumer groups; built-in partitioning + ordering per partition.
- Use SQS FIFO queues for ordering per group key.
Throughput scales linearly with partitions until per-partition throughput limits.
10. Common patterns
- Outbox (S40) — write the job to a DB outbox table in the same txn as the business write; a separate poller relays to the queue. Ensures atomicity DB↔queue.
- Saga — long multi-step transactional flow; each step is a job; failures trigger compensating jobs.
- Scheduled jobs — delayed delivery; persistent timer; many queues support natively (SQS up to 15min, Sidekiq scheduled set, Cloud Tasks).
- Workflow orchestrator — Temporal, Airflow, Step Functions — when you outgrow plain queues.
11. Observability
- Queue depth per queue — backlog gauge.
- Oldest message age — leading indicator of falling behind.
- Process duration p50/p99 — performance.
- Failure rate — going up = something's broken.
- DLQ inflow — going up = poison messages.
Alert on age, not just depth. A backlog of 1M items where the oldest is 30s old is fine; 100 items where the oldest is 3 days old is broken.
12. Tech choices in 2026
- SQS / Cloud Tasks / Azure Queue — managed, cheap, "good enough".
- Redis (Bull, BullMQ, Sidekiq) — fast, simple, manage yourself.
- Kafka — when ordering + replay matter more than queue semantics.
- RabbitMQ — old guard; routing flexibility (topic exchanges).
- Temporal / Cadence — when you need orchestration + state + retries baked in.
- Postgres LISTEN/NOTIFY + outbox — lightweight; great when you already have Postgres.
Start with managed; outgrow when you have justification.
13. Reality check
A new microservice's queue setup:
- Pick managed (SQS, Cloud Tasks).
- Set visibility timeout = 3× expected job duration.
- Set max retries = 5; DLQ on overflow.
- Implement consumer with dedupe table.
- Alert on DLQ depth > 0; alert on oldest-message age > SLA.
- Add structured logging with job ID; correlate.
You can ship reliable async work in an afternoon with this. Iterate from there.
Reading material
- Designing Data-Intensive Applications — messaging chapter
- SQS developer guide
- Marc Brooker — Exponential backoff and jitter
- Sidekiq best practices
In-depth research material
Video reference
▶︎ Designing a Job Queue (System Design Deep Dive)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Design Task Scheduler
- Link: https://leetcode.com/problems/design-task-scheduler/
- Difficulty: Medium
- Why this problem: Schedule tasks across N workers with cooldowns — same primitives as visibility timeout + priority.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Compare at-most-once, at-least-once, exactly-once and pick for a use case.
- Draw the job lifecycle with retry + DLQ.
- Implement exponential backoff with full jitter (and explain why).
- Build an idempotent consumer with dedupe table + transactional update.
- Pick between SQS, Redis, Kafka, Temporal for a given workload.
- Solve
design-task-scheduler— heap + cooldown bookkeeping; the scheduler core.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.