Search Tech Journey

Find topics, journeys and posts

back to blog
engineeringintermediate 12m2026-06-09

Production Error Handling — Retries, Circuit Breakers, Timeouts, Bulkheads

Session 47 of the 48-session learning series.

Date: Thu, 2026-07-16 · Time: 18:00–20:00 IST · Track: 🧱 OOP & Languages (OOP) · Parent 28-day topic: Day 10 · Est. read: 2 h

Why this session matters

This is Session 47 of 48 in the OOP track. The difference between a service that survives a downstream outage and one that takes the whole site down is the discipline of timeouts, retries, circuit breakers, and bulkheads. These patterns are 20 years old; most engineers still get them wrong on the first try.

Agenda

  • Timeouts — every network call, every time, with sensible numbers
  • Retry policies — when to retry, how to back off, when to stop
  • Circuit breakers — fail fast when downstream is dead
  • Bulkheads — isolation so one failure doesn't sink the ship
  • Graceful degradation — partial success > total failure

Pre-read (skim before the session)

Deep dive

1. Every network call has a timeout

If you make a network call without a timeout, the default is forever. Forever means:

  • One slow downstream → all your threads/connections pinned waiting.
  • Your service stops responding.
  • Cascading failure across the system.

Rule: every HTTP client, every DB driver, every gRPC call gets explicit timeouts. No exceptions.

Timeout types:

  • Connect — TCP handshake; 1–3 s typical.
  • Read — response start; depends on operation.
  • Total / deadline — entire call; the safety net.

2. Sensible timeout values

OperationTypical timeout
DB query (point lookup)100–500 ms
DB query (analytic)5–30 s
External API call1–5 s
LLM inference30–60 s (streaming!)
Internal RPC100 ms – 2 s
Cache lookup10–50 ms
File uploadper-MB budget

Always: timeout < the upstream caller's timeout. Otherwise you waste work on requests no one's waiting for.

3. Retry — yes, no, sometimes

Retry only if:

  • Failure is transient — timeout, 503, network blip.
  • Operation is idempotent — same input → same outcome on repeat.

Do NOT retry:

  • 4xx errors (user's fault; won't change).
  • POST with side effects unless idempotent key supplied.
  • After circuit breaker opens.
  • When retry budget exhausted.

4. Retry strategies

  • No retry — fail fast.
  • Fixed retry — N attempts, fixed delay. Worst pattern.
  • Exponential backoffdelay = base * 2^attempt.
  • Exponential backoff + jitterdelay = random(0, base * 2^attempt). Always this.

Max attempts: 2–4 typical. After that the caller has problems retry can't fix.

Modern HTTP clients (aiohttp, requests + urllib3 Retry, AWS SDKs) support this out-of-box. Use the built-in.

5. Circuit breakers

A circuit breaker is a state machine in front of a downstream call:

       success
         │
         ▼
    ┌─CLOSED ─────success──→─┐
    │   │                    │
    │   N failures           │
    │   │                    │
    │   ▼                    │
   OPEN ─── after timeout → HALF-OPEN
    │                        │
    └─ all calls fail fast ──┘
                  ▲
                  └ failure
  • CLOSED — calls pass through normally.
  • OPEN — calls fail immediately (without hitting downstream). Recovery window.
  • HALF-OPEN — limited probe traffic; success closes, failure re-opens.

When downstream is dead, the breaker opens within seconds; your service stops piling on. Saves the dependency and your latency.

Libraries: resilience4j (Java), polly (.NET), pybreaker (Python).

6. Bulkheads

Bulkheads = isolate resources so one slow call doesn't starve others.

Patterns:

  • Thread pool per dependency — slow service has its own pool; can't exhaust the global pool.
  • Connection pool per dependency — same idea.
  • Async with semaphore per dependency — limit concurrent calls per target.
  • Process / pod separation — different microservices for different blast radius.

Without bulkheads: one slow vendor API → all your worker threads stuck → your service unavailable for requests that don't even use that vendor.

7. Graceful degradation

When something fails, can you still partially succeed?

Examples:

  • Recommendation service down → return popular items as fallback.
  • Personalisation service slow → render generic feed.
  • Image enrichment down → show placeholder, retry async.
  • Auth-info enrichment down → show basic name, fetch later.

The discipline: every dependent feature has a fallback. Design with "what if this is down?" in mind.

8. Hedged requests

For low-latency systems with replicas:

  • Send the request to replica A.
  • If no response in 99-th percentile time, also send to replica B.
  • Use whichever responds first.

Trades ~2× cost for tail-latency reduction. Great for read-heavy KV stores.

9. Timeouts and deadline propagation

When service A calls service B which calls service C:

  • A sets 500 ms timeout for B.
  • B should call C with at most 400 ms (saving 100 ms for B's own work).
  • C must respect that 400 ms.

gRPC propagates deadlines automatically. HTTP requires explicit X-Deadline headers + client logic. Most teams forget; cascade timeouts pile up.

10. Idempotency revisited

Retry-safe operations require:

  • Server stores (idempotency_key, response) for some TTL.
  • Same key in second request → return cached response, not re-execute.
  • 409 Conflict if same key + different body.

Mandatory for retries on POST. (Recap from S31.)

11. Observability hooks

Per-dependency metrics:

  • Call rate.
  • Error rate.
  • Latency p50, p99.
  • Timeout rate.
  • Retry rate.
  • Circuit breaker state.

If you can't see these, you're flying blind. Standard exporters with sane labels make dashboards trivial.

12. Anti-patterns

  • Catch and ignore exceptions silently. (Always log and propagate.)
  • Retry every error class.
  • Infinite retry loop.
  • Circuit breaker with too-low threshold (flips on every blip).
  • Timeout configured but ignored by the client library (looking at you, old Apache HttpClient).
  • Bulkheads without monitoring — pool full, no alert.

13. Reality check

A new service's resilience checklist:

  • Timeouts on every outbound call. Document the value.
  • Retry with exponential backoff + jitter for idempotent operations.
  • Circuit breaker around each external dependency.
  • Bulkhead — separate connection pool per dependency.
  • Fallback strategy for each user-visible feature.
  • Deadline propagation for any multi-hop request.
  • Metrics + alerts on each of the above.

Doing all of this is a 1-week investment per service. Skipping it is the difference between "vendor X had an outage" and "all our customers were down for 4 hours".

Reading material

In-depth research material

Video reference

▶︎ Patterns for Resilient Microservices (Adrian Hornsby)

Pick a quiet 30 minutes during this session to actually watch it. Don't multitask.

LeetCode — Design Circular Deque

Post-session checklist

By the end of this session you should be able to:

  • Set timeouts on every outbound call with justified values.
  • Write retry policy with exponential backoff + jitter and bounded attempts.
  • Implement a circuit breaker with CLOSED / OPEN / HALF-OPEN states.
  • Apply bulkheads to isolate one dependency's failures.
  • Design graceful-degradation fallbacks for user-visible features.
  • Solve design-circular-deque — the bounded-queue primitive inside resilience pool implementations.

Generated from sessions_data.py + content_part*.py. To edit a video / leetcode / title, edit the data file and re-run write_sessions.py.