Day 15 — Memory Model & Garbage Collection — Heap, GC, Leaks, Profiling
High-throughput services live and die by GC. Knowing the heap layout, GC algorithms and how to read a flame graph is the difference between '99p = 80 ms' and '9…
Every managed runtime gives you the illusion of infinite memory. The bill comes due as pause times, OOMs, and gradually-rising RSS. Understanding the model lets you predict and prevent both.
🧠 Concept
Why it matters & the mental model.
1. Two object lifecycle models
- Reference counting (CPython, Swift, Obj-C ARC): each object has a count; decremented on every deref; freed at 0. Pros: deterministic, low pause. Cons: doesn't handle cycles → CPython also runs a cycle collector periodically. ARC requires programmer to break cycles with weak refs.
- Tracing GC (JVM, Go, .NET, JavaScript): periodically traverse from GC roots, mark reachable, sweep / compact the rest. Pros: handles cycles, can compact. Cons: stop-the-world pauses (mitigated by generational / concurrent collectors).
2. Generational hypothesis
Most objects die young. Split the heap into young (eden + survivor) and old generations; collect young frequently (cheap), old rarely (expensive). All modern GCs use this.
3. JVM collector zoo (high level)
- Serial / Parallel: throughput-optimised, multi-second pauses.
- CMS (deprecated): concurrent old gen, fragmented.
- G1: regional, target pause time, default in 11+. Good general choice.
- ZGC / Shenandoah: sub-millisecond pauses on huge heaps (TB scale), concurrent compaction. Use for latency-critical services.
Tune via
-Xms = -Xmx(avoid resize),-XX:MaxGCPauseMillis,-XX:+UseStringDeduplication.
🛠 Deep Dive
Internals, code, architecture.
4. Go's GC
Concurrent mark-sweep, non-generational (yet), tri-colour, write barriers. Goal: sub-ms STW. Goroutine stacks grow/shrink. GOGC=100 (default) means GC when heap doubles since last GC.
5. .NET GC
Generational, server vs workstation modes. Server GC uses one heap per core, parallel collection — much higher throughput on multicore. Background GC does concurrent collection of gen 2.
6. Python (CPython)
RefCount + cycle collector. No heap compaction (objects don't move). Big allocator pools (pymalloc for small objects). Memory often returned to pool, not OS → RSS doesn't shrink even after del. The fix is process recycling (e.g. gunicorn max_requests).
7. Common leaks
- Global / module-level caches with no eviction.
- Long-lived listeners holding references to short-lived objects (event bus, signals).
- Closures capturing big locals (especially in async).
- Connection pools without size limit.
- Logging frameworks buffering records.
- In JVM: ThreadLocals in pooled threads; classloader leaks in app servers.
🚀 In Practice
Trade-offs, exercises, what to ship today.
8. Profiling toolbox
- py-spy / austin / scalene for Python: sampling profilers, low overhead, work on live processes.
- async-profiler for JVM: CPU + alloc + lock profiling, flame graph output.
- pprof for Go: built-in,
go tool pprof. - dotnet-trace / dotnet-counters for .NET.
- eBPF (bcc, bpftrace) for OS-level allocs, page faults, syscall flame graphs.
9. Reading a flame graph
Width = time spent. Look for wide flat plateaus (true hot path) and unexpected ancestors (a string formatting call eating 30% under your "fast" function). Diff two flame graphs to see what changed across deploys.
10. Practical tips
- Pre-size collections (
list(int),dict,StringBuilder). - Pool large objects (buffers, regex, JSON encoders).
- Avoid object churn in hot loops; reuse instances or use
array/numpyfor primitives. - Stream don't slurp — line-by-line iteration beats
f.read(). - For services: profile first deploy under load, set heap =
1.5× steady state, monitorgc_pause_p99.
11. What to take away
"How would you debug a memory leak in production?" Strong answers name the profiler, the heap dump tool, the difference between RSS / committed / used, and the steady-state-after-warmup question. Bonus: distinguish leak from caching working as designed.
Resources
- 🎥 JVM Garbage Collection — Aleksey Shipilev
- 📖 CPython object model — Python docs
- 📖 .NET GC fundamentals
- 📖 Brendan Gregg — Flame graphs
Practice Problem: Trapping Rain Water (Hard)