engineeringintermediate 12m2026-06-09

Production Error Handling — Retries, Circuit Breakers, Timeouts, Bulkheads

Session 47 of the 48-session learning series.

Date: Thu, 2026-07-16 · Time: 18:00–20:00 IST · Track: 🧱 OOP & Languages (OOP) · Parent 28-day topic: Day 10 · Est. read: 2 h

Why this session matters

This is Session 47 of 48 in the OOP track. The difference between a service that survives a downstream outage and one that takes the whole site down is the discipline of timeouts, retries, circuit breakers, and bulkheads. These patterns are 20 years old; most engineers still get them wrong on the first try.

Agenda

Timeouts — every network call, every time, with sensible numbers
Retry policies — when to retry, how to back off, when to stop
Circuit breakers — fail fast when downstream is dead
Bulkheads — isolation so one failure doesn't sink the ship
Graceful degradation — partial success > total failure

Pre-read (skim before the session)

Deep dive

1. Every network call has a timeout

If you make a network call without a timeout, the default is forever. Forever means:

One slow downstream → all your threads/connections pinned waiting.
Your service stops responding.
Cascading failure across the system.

Rule: every HTTP client, every DB driver, every gRPC call gets explicit timeouts. No exceptions.

Timeout types:

Connect — TCP handshake; 1–3 s typical.
Read — response start; depends on operation.
Total / deadline — entire call; the safety net.

2. Sensible timeout values

Operation	Typical timeout
DB query (point lookup)	100–500 ms
DB query (analytic)	5–30 s
External API call	1–5 s
LLM inference	30–60 s (streaming!)
Internal RPC	100 ms – 2 s
Cache lookup	10–50 ms
File upload	per-MB budget

Always: timeout < the upstream caller's timeout. Otherwise you waste work on requests no one's waiting for.

3. Retry — yes, no, sometimes

Retry only if:

Failure is transient — timeout, 503, network blip.
Operation is idempotent — same input → same outcome on repeat.

Do NOT retry:

4xx errors (user's fault; won't change).
POST with side effects unless idempotent key supplied.
After circuit breaker opens.
When retry budget exhausted.

4. Retry strategies

No retry — fail fast.
Fixed retry — N attempts, fixed delay. Worst pattern.
Exponential backoff — delay = base * 2^attempt.
Exponential backoff + jitter — delay = random(0, base * 2^attempt). Always this.

Max attempts: 2–4 typical. After that the caller has problems retry can't fix.

Modern HTTP clients (aiohttp, requests + urllib3 Retry, AWS SDKs) support this out-of-box. Use the built-in.

5. Circuit breakers

A circuit breaker is a state machine in front of a downstream call:

       success
         │
         ▼
    ┌─CLOSED ─────success──→─┐
    │   │                    │
    │   N failures           │
    │   │                    │
    │   ▼                    │
   OPEN ─── after timeout → HALF-OPEN
    │                        │
    └─ all calls fail fast ──┘
                  ▲
                  └ failure

CLOSED — calls pass through normally.
OPEN — calls fail immediately (without hitting downstream). Recovery window.
HALF-OPEN — limited probe traffic; success closes, failure re-opens.

When downstream is dead, the breaker opens within seconds; your service stops piling on. Saves the dependency and your latency.

Libraries: resilience4j (Java), polly (.NET), pybreaker (Python).

6. Bulkheads

Bulkheads = isolate resources so one slow call doesn't starve others.

Patterns:

Thread pool per dependency — slow service has its own pool; can't exhaust the global pool.
Connection pool per dependency — same idea.
Async with semaphore per dependency — limit concurrent calls per target.
Process / pod separation — different microservices for different blast radius.

Without bulkheads: one slow vendor API → all your worker threads stuck → your service unavailable for requests that don't even use that vendor.

7. Graceful degradation

When something fails, can you still partially succeed?

Examples:

Recommendation service down → return popular items as fallback.
Personalisation service slow → render generic feed.
Image enrichment down → show placeholder, retry async.
Auth-info enrichment down → show basic name, fetch later.

The discipline: every dependent feature has a fallback. Design with "what if this is down?" in mind.

8. Hedged requests

For low-latency systems with replicas:

Send the request to replica A.
If no response in 99-th percentile time, also send to replica B.
Use whichever responds first.

Trades ~2× cost for tail-latency reduction. Great for read-heavy KV stores.

9. Timeouts and deadline propagation

When service A calls service B which calls service C:

A sets 500 ms timeout for B.
B should call C with at most 400 ms (saving 100 ms for B's own work).
C must respect that 400 ms.

gRPC propagates deadlines automatically. HTTP requires explicit X-Deadline headers + client logic. Most teams forget; cascade timeouts pile up.

10. Idempotency revisited

Retry-safe operations require:

Server stores (idempotency_key, response) for some TTL.
Same key in second request → return cached response, not re-execute.
409 Conflict if same key + different body.

Mandatory for retries on POST. (Recap from S31.)

11. Observability hooks

Per-dependency metrics:

Call rate.
Error rate.
Latency p50, p99.
Timeout rate.
Retry rate.
Circuit breaker state.

If you can't see these, you're flying blind. Standard exporters with sane labels make dashboards trivial.

12. Anti-patterns

Catch and ignore exceptions silently. (Always log and propagate.)
Retry every error class.
Infinite retry loop.
Circuit breaker with too-low threshold (flips on every blip).
Timeout configured but ignored by the client library (looking at you, old Apache HttpClient).
Bulkheads without monitoring — pool full, no alert.

13. Reality check

A new service's resilience checklist:

Timeouts on every outbound call. Document the value.
Retry with exponential backoff + jitter for idempotent operations.
Circuit breaker around each external dependency.
Bulkhead — separate connection pool per dependency.
Fallback strategy for each user-visible feature.
Deadline propagation for any multi-hop request.
Metrics + alerts on each of the above.

Doing all of this is a 1-week investment per service. Skipping it is the difference between "vendor X had an outage" and "all our customers were down for 4 hours".

Link: https://leetcode.com/problems/design-circular-deque/
Difficulty: Medium
Why this problem: Bounded buffer / ring with front + rear — the data structure inside bulkhead pool implementations.
Time-box: 30 minutes. Look up the editorial only after.

Post-session checklist

By the end of this session you should be able to:

Set timeouts on every outbound call with justified values.
Write retry policy with exponential backoff + jitter and bounded attempts.
Implement a circuit breaker with CLOSED / OPEN / HALF-OPEN states.
Apply bulkheads to isolate one dependency's failures.
Design graceful-degradation fallbacks for user-visible features.
Solve design-circular-deque — the bounded-queue primitive inside resilience pool implementations.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.

← previous

Observability for Data Pipelines — SLAs, SLOs, Freshness, Data Tests

Capstone — Building a Production AI Agent End-to-End