Production Error Handling — Retries, Circuit Breakers, Timeouts, Bulkheads
Session 47 of the 48-session learning series.
Date: Thu, 2026-07-16 · Time: 18:00–20:00 IST · Track: 🧱 OOP & Languages (OOP) · Parent 28-day topic: Day 10 · Est. read: 2 h
Why this session matters
This is Session 47 of 48 in the OOP track. The difference between a service that survives a downstream outage and one that takes the whole site down is the discipline of timeouts, retries, circuit breakers, and bulkheads. These patterns are 20 years old; most engineers still get them wrong on the first try.
Agenda
- Timeouts — every network call, every time, with sensible numbers
- Retry policies — when to retry, how to back off, when to stop
- Circuit breakers — fail fast when downstream is dead
- Bulkheads — isolation so one failure doesn't sink the ship
- Graceful degradation — partial success > total failure
Pre-read (skim before the session)
- Release It! (Michael Nygard) — chapters 5–6
- AWS Builders' Library — Timeouts, retries, and backoff
- Netflix Hystrix README (legacy but classic)
- resilience4j docs
Deep dive
1. Every network call has a timeout
If you make a network call without a timeout, the default is forever. Forever means:
- One slow downstream → all your threads/connections pinned waiting.
- Your service stops responding.
- Cascading failure across the system.
Rule: every HTTP client, every DB driver, every gRPC call gets explicit timeouts. No exceptions.
Timeout types:
- Connect — TCP handshake; 1–3 s typical.
- Read — response start; depends on operation.
- Total / deadline — entire call; the safety net.
2. Sensible timeout values
| Operation | Typical timeout |
|---|---|
| DB query (point lookup) | 100–500 ms |
| DB query (analytic) | 5–30 s |
| External API call | 1–5 s |
| LLM inference | 30–60 s (streaming!) |
| Internal RPC | 100 ms – 2 s |
| Cache lookup | 10–50 ms |
| File upload | per-MB budget |
Always: timeout < the upstream caller's timeout. Otherwise you waste work on requests no one's waiting for.
3. Retry — yes, no, sometimes
Retry only if:
- Failure is transient — timeout, 503, network blip.
- Operation is idempotent — same input → same outcome on repeat.
Do NOT retry:
- 4xx errors (user's fault; won't change).
- POST with side effects unless idempotent key supplied.
- After circuit breaker opens.
- When retry budget exhausted.
4. Retry strategies
- No retry — fail fast.
- Fixed retry — N attempts, fixed delay. Worst pattern.
- Exponential backoff —
delay = base * 2^attempt. - Exponential backoff + jitter —
delay = random(0, base * 2^attempt). Always this.
Max attempts: 2–4 typical. After that the caller has problems retry can't fix.
Modern HTTP clients (aiohttp, requests + urllib3 Retry, AWS SDKs) support this out-of-box. Use the built-in.
5. Circuit breakers
A circuit breaker is a state machine in front of a downstream call:
success
│
▼
┌─CLOSED ─────success──→─┐
│ │ │
│ N failures │
│ │ │
│ ▼ │
OPEN ─── after timeout → HALF-OPEN
│ │
└─ all calls fail fast ──┘
▲
└ failure
- CLOSED — calls pass through normally.
- OPEN — calls fail immediately (without hitting downstream). Recovery window.
- HALF-OPEN — limited probe traffic; success closes, failure re-opens.
When downstream is dead, the breaker opens within seconds; your service stops piling on. Saves the dependency and your latency.
Libraries: resilience4j (Java), polly (.NET), pybreaker (Python).
6. Bulkheads
Bulkheads = isolate resources so one slow call doesn't starve others.
Patterns:
- Thread pool per dependency — slow service has its own pool; can't exhaust the global pool.
- Connection pool per dependency — same idea.
- Async with semaphore per dependency — limit concurrent calls per target.
- Process / pod separation — different microservices for different blast radius.
Without bulkheads: one slow vendor API → all your worker threads stuck → your service unavailable for requests that don't even use that vendor.
7. Graceful degradation
When something fails, can you still partially succeed?
Examples:
- Recommendation service down → return popular items as fallback.
- Personalisation service slow → render generic feed.
- Image enrichment down → show placeholder, retry async.
- Auth-info enrichment down → show basic name, fetch later.
The discipline: every dependent feature has a fallback. Design with "what if this is down?" in mind.
8. Hedged requests
For low-latency systems with replicas:
- Send the request to replica A.
- If no response in 99-th percentile time, also send to replica B.
- Use whichever responds first.
Trades ~2× cost for tail-latency reduction. Great for read-heavy KV stores.
9. Timeouts and deadline propagation
When service A calls service B which calls service C:
- A sets 500 ms timeout for B.
- B should call C with at most 400 ms (saving 100 ms for B's own work).
- C must respect that 400 ms.
gRPC propagates deadlines automatically. HTTP requires explicit X-Deadline headers + client logic. Most teams forget; cascade timeouts pile up.
10. Idempotency revisited
Retry-safe operations require:
- Server stores
(idempotency_key, response)for some TTL. - Same key in second request → return cached response, not re-execute.
409 Conflictif same key + different body.
Mandatory for retries on POST. (Recap from S31.)
11. Observability hooks
Per-dependency metrics:
- Call rate.
- Error rate.
- Latency p50, p99.
- Timeout rate.
- Retry rate.
- Circuit breaker state.
If you can't see these, you're flying blind. Standard exporters with sane labels make dashboards trivial.
12. Anti-patterns
- Catch and ignore exceptions silently. (Always log and propagate.)
- Retry every error class.
- Infinite retry loop.
- Circuit breaker with too-low threshold (flips on every blip).
- Timeout configured but ignored by the client library (looking at you, old Apache HttpClient).
- Bulkheads without monitoring — pool full, no alert.
13. Reality check
A new service's resilience checklist:
- Timeouts on every outbound call. Document the value.
- Retry with exponential backoff + jitter for idempotent operations.
- Circuit breaker around each external dependency.
- Bulkhead — separate connection pool per dependency.
- Fallback strategy for each user-visible feature.
- Deadline propagation for any multi-hop request.
- Metrics + alerts on each of the above.
Doing all of this is a 1-week investment per service. Skipping it is the difference between "vendor X had an outage" and "all our customers were down for 4 hours".
Reading material
- Release It! (Michael Nygard, 2nd ed.)
- AWS Builders' Library — patterns
- Designing Distributed Systems (Brendan Burns)
- Marc Brooker — Timeouts, retries, jitter
In-depth research material
- resilience4j docs
- Polly docs (.NET)
- Envoy Proxy — Timeouts and retries
- Tigerbeetle — Reliability discussions
Video reference
▶︎ Patterns for Resilient Microservices (Adrian Hornsby)
Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.
LeetCode — Design Circular Deque
- Link: https://leetcode.com/problems/design-circular-deque/
- Difficulty: Medium
- Why this problem: Bounded buffer / ring with front + rear — the data structure inside bulkhead pool implementations.
- Time-box: 30 minutes. Look up the editorial only after.
Post-session checklist
By the end of this session you should be able to:
- Set timeouts on every outbound call with justified values.
- Write retry policy with exponential backoff + jitter and bounded attempts.
- Implement a circuit breaker with CLOSED / OPEN / HALF-OPEN states.
- Apply bulkheads to isolate one dependency's failures.
- Design graceful-degradation fallbacks for user-visible features.
- Solve
design-circular-deque— the bounded-queue primitive inside resilience pool implementations.
Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.