Overall Engineering Clarity — Data, Distributed Systems and AI
This is a single, long, primary study companion. Read it cover-to-cover the first time, then jump back into a section before you need to discuss it. Every section follows the same shape:
mental model → flow / diagram → internals → decision tables → code → Q&A with reasoning.
How to read each concept — the "Why → How → Trade-off" framework
Most engineers can recite what a thing is. The clarity gap shows up when you ask why it exists and what breaks. Use this template for every concept:
What it is → one sentence definition
Why it exists → the problem before this solution
How it works → internals / mechanism
When to use → correct fit
When NOT to use → anti-patterns
Trade-offs → latency, cost, consistency, complexity
Failure modes → what bites you in production
Concrete example → code or architecture you can point at
Intuition shortcut. Every technology in this post solves one of three pains:
- One machine isn't enough → distribution, partitioning, replication.
- Cheap object storage isn't transactional → lakehouse formats, commit protocols.
- Models don't know your private/fresh data, and they hallucinate → embeddings, retrieval, agents, evals.
not enough"] --> S1["Spark · Sharding ·
Replication · Consensus"] P2["Object storage
not transactional"] --> S2["Iceberg · Delta · Hudi"] P3["LLM doesn't know
your data + can hallucinate"] --> S3["Embeddings · RAG · Agents · Evals"] S1 --> O["Reliable, scalable,
governed data + AI platform"] S2 --> O S3 --> O
Default for every answer: problem first, mechanism second, trade-off third, failure mode fourth, example fifth.
Part I — Distributed Compute
1. Apache Spark — Internals That Actually Matter
1.1 Mental model
Spark is a distributed query engine. The single sentence to remember:
Spark performance is mostly about where data moves, not how much CPU you have.
The bottleneck is almost always shuffle — moving rows across the network so that all rows with the same key meet on the same reducer.
1.2 Execution flow — what really happens when you call an action
(SparkContext)"] --> B["DAG Scheduler
splits at shuffle boundaries"] B --> C["Task Scheduler
1 task per partition per stage"] C --> D["Cluster Manager
YARN / K8s / Standalone"] D --> E1["Executor 1
(JVM + Python worker)"] D --> E2["Executor 2"] D --> E3["Executor N"] E1 --> S["Shuffle Service
(sort-based / push-based)"] E2 --> S E3 --> S S --> R["Reducer tasks
next stage"] R --> Out["Action result"]
Layers of work:
- Application → Job (per action) → Stage (split by shuffle) → Task (per partition).
- DAG Scheduler decides stages; Task Scheduler dispatches tasks; BlockManager stores intermediate data on each executor.
- Narrow transformations (
map,filter,select,mapPartitions,union) keep partitions local. - Wide transformations (
groupBy,join,distinct,repartition) cause shuffle. - Catalyst rewrites your logical plan; Tungsten generates byte-efficient Java code; AQE (Adaptive Query Execution) flips plans at runtime once it sees actual partition sizes.
1.3 Catalyst — the optimizer pipeline
SQL"] --> P["Parsed Logical Plan"] P --> A["Analyzed Plan
(resolve cols, fns, types)"] A --> O["Optimized Logical Plan
(rule-based)"] O --> Phys["Physical Plans
(many candidates)"] Phys --> Cost["Cost-Based Pick
(uses ANALYZE stats)"] Cost --> CG["Whole-stage Codegen
(Tungsten)"] CG --> Exec["Execute on executors"]
Key rules Catalyst applies (you should be able to name them):
| Rule | What it does |
|---|---|
| Predicate pushdown | Push WHERE filters into the file scan (Parquet stats / partition pruning). |
| Projection pruning | Read only the columns the query references. |
| Constant folding | Evaluate constant expressions at plan time. |
| Boolean simplification | (x AND TRUE) → x; collapse always-false branches. |
| Join reordering | Use stats to pick the cheapest join order (CBO). |
| Reorder filters & projections | Push them as low as possible in the tree. |
1.4 Tungsten — what makes Spark fast on the CPU
- Whole-stage codegen collapses an entire operator tree (filter → project → aggregate) into a single Java function. JVM JIT can inline and vectorize that.
- Off-heap binary format avoids JVM object overhead and GC churn.
- Cache-friendly columnar processing keeps L1/L2 hot.
UDFs hurt because Catalyst can't see inside opaque Python/Scala lambdas, so codegen is broken and predicate pushdown is blocked.
1.5 AQE — Adaptive Query Execution
Three runtime optimizations that fire during query execution (default on since 3.2):
- Coalesce shuffle partitions — small post-shuffle partitions are merged so reducers aren't underutilized.
- Skew join handling — a heavily skewed partition is split into sub-partitions and the other side is replicated.
- Switch join strategy — if a planned sort-merge join's input turns out small enough at runtime, AQE flips it to broadcast hash.
1.6 Why shuffle is expensive — the full picture
partition by hash(key)"] --> W1["Local sort
+ spill to disk"] M2["Map task"] --> W2["Local sort
+ spill"] W1 --> N["Network fetch
(reducer pulls or
push-based shuffle)"] W2 --> N N --> R1["Reducer
merge + aggregate"] N --> R2["Reducer"] R1 --> S["Sort if needed
(SMJ)"] R2 --> S
Costs that pile up:
- Serialization (Kryo/Java) on every record.
- Disk spill if shuffle data exceeds memory.
- Network I/O between executors.
- Sorting (sort-based shuffle is the default).
- Reducer pressure — one slow reducer becomes a straggler.
Push-based shuffle (3.2+): map outputs are pushed to the reducer's host and merged before the reduce stage starts. Big throughput win on K8s where no external shuffle service runs.
1.7 Memory model per executor
spark.executor.memory (heap)
├── reserved (300 MB)
├── user memory (40 %) ← your local objects
└── unified region (60 %) ← splits dynamically
├── execution (sort, join, agg, shuffle)
└── storage (cache, broadcast)
spark.executor.memoryOverhead ← off-heap, JVM overhead, Python workers
OOMs almost always mean: too few partitions, skewed partitions, oversized cache, or memoryOverhead too small for PySpark workers / native libs.
1.8 The five things that make a Spark job slow
column pruning,
partition design"] B --> B1["Broadcast small side,
bucketing, AQE"] C --> C1["Skew salting,
AQE skew join"] D --> D1["Increase partitions,
tune memoryOverhead,
persist DISK_SER"] E --> E1["Repartition before write,
compact, target 128–512 MB"]
1.9 Spark UI — your diagnostic flow
Always inspect in this order:
- SQL tab — physical plan + actual row counts. Spot accidental cartesians, wrong join strategy, missing pushdown.
- Stages tab — input size, shuffle read/write, spill, task duration distribution. Median vs max ratio reveals skew.
- Executors tab — GC time (>10% of task time → memory pressure), failed tasks, memoryUsed.
- Storage tab — cache hit ratio, blocks evicted.
- Environment tab — confirm config actually loaded what you set.
1.10 Join strategy cheat sheet
| Join | When chosen | Cost | Failure mode |
|---|---|---|---|
| Broadcast Hash | small side < spark.sql.autoBroadcastJoinThreshold (10 MB default) |
no shuffle on big side, RAM on every executor | driver/executor OOM if "small" side is wrong |
| Shuffle Hash | medium tables, AQE may pick | both sides shuffled, hash table on smaller side | OOM if hash table too big |
| Sort-Merge | both large, equi-join | shuffle + sort each side | sort spill, skew |
| Broadcast Nested Loop | non-equi join, small side | quadratic | catastrophic on big input |
| Cartesian | last resort, no condition | <span class="katex-inline" data-katex=" | A |
1.11 Skew — recognize and fix
Symptoms (Spark UI): one task at 30 min while peers finish in 30 s; one partition's shuffle read is 50× the median; max task time ≫ median.
Why it happens: Hash partitioning sends every row with the same key to the same reducer. One hot key = one overloaded reducer.
Fixes, in order of cheapness:
- Turn on AQE skew join (
spark.sql.adaptive.skewJoin.enabled=true). - Pre-aggregate before the join (combine map-side).
- Salt the hot key.
Salting code (manual, when AQE isn't enough):
from pyspark.sql import functions as F
N = 32 # salt fan-out
big = df_big.withColumn("salt", (F.rand() * N).cast("int"))
small = (df_small
.crossJoin(spark.range(N).withColumnRenamed("id", "salt")))
joined = (big.join(small, ["key", "salt"])
.drop("salt"))
The salt scatters the hot key across N reducers. The small side is replicated N times so the join still produces every original row.
1.12 Coding patterns you should know cold
Top-K per group with a window:
from pyspark.sql import functions as F, Window
w = Window.partitionBy("category").orderBy(F.col("score").desc())
top3 = (df.withColumn("rk", F.row_number().over(w))
.where("rk <= 3")
.drop("rk"))
Dedupe — keep the latest record per key:
w = Window.partitionBy("id").orderBy(F.col("event_ts").desc())
latest = (df.withColumn("rk", F.row_number().over(w))
.where("rk = 1")
.drop("rk"))
Sessionization with watermark:
sessions = (events
.withWatermark("ts", "30 minutes")
.groupBy(F.session_window("ts", "15 minutes"), "user_id")
.agg(F.count("*").alias("hits")))
Per-column profile in one pass (governance use case):
agg_exprs = []
for c in df.columns:
agg_exprs += [
F.sum(F.col(c).isNull().cast("int")).alias(f"{c}_nulls"),
F.approx_count_distinct(c).alias(f"{c}_ndv"),
]
profile = df.agg(*agg_exprs) # single shuffle, all columns at once
Idempotent upsert into Iceberg:
MERGE INTO catalog.db.dim_customer t
USING updates u
ON t.cust_id = u.cust_id
WHEN MATCHED AND u.op = 'D' THEN DELETE
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;
1.13 Spark on Kubernetes — what's different
- Each executor is a pod; the driver pod creates them.
- Dynamic allocation + shuffle tracking (3.2+) means you don't need an external shuffle service — executors are kept around if they hold shuffle data still in use.
- Pin executors to nodepools via taints/tolerations; spot pools for cost; non-spot pool for the driver.
1.14 Q&A with reasoning
Q1. Why is reduceByKey better than groupByKey().mapValues(_.sum)?
Because reduceByKey performs a map-side combine before the shuffle. The reducer receives one partial aggregate per partition per key instead of every raw value. Less network, less spill, less reducer memory.
Q2. A task takes 30 min while peers finish in 30 s. What's happening?
Classic data skew — one key has many more rows. Spark UI shows max ≫ median task time and one task's shuffle read is huge. Fix with AQE skew join or manual salting; pre-aggregate where possible.
Q3. Why do Python UDFs hurt performance? They serialize each row out of the JVM, ship it to a Python worker, run Python, and bring results back. Catalyst can't see inside the UDF, so codegen and predicate pushdown are blocked. Use SQL functions or Pandas UDFs (Arrow batches; columnar, vectorized) instead.
Q4. cache() vs checkpoint() — when to use which?
cache keeps lineage; if a partition is lost, Spark recomputes from the lineage chain. checkpoint writes to reliable storage and truncates lineage. Use checkpoint to break long iterative chains (graph algorithms, ML loops) where lineage recomputation would be catastrophic on failure.
Q5. repartition vs coalesce?
repartition(N) does a full shuffle to evenly redistribute. coalesce(N) (where N < current partitions) merges existing partitions without shuffle, but produces unbalanced output. repartitionByRange(N, col) partitions by range using sampling — useful before sorted writes.
Q6. PySpark task crashes with Container killed by YARN, exceeding memory limits. Heap is fine. Why?
The Python worker memory and native libs live in memoryOverhead, not heap. Bump spark.executor.memoryOverhead to ~10–20 % of executor memory and avoid loading large objects into the Python worker.
Q7. Why does broadcast join risk OOM? The small side is collected to the driver, then sent to every executor. If "small" is actually 500 MB and you have 200 executors, you've allocated 100 GB of replicated memory and pulled 500 MB through the driver.
Q8. What does df.explain(true) show, and what would you look for?
Parsed → Analyzed → Optimized logical plans + physical plan. You scan for: chosen join type, presence of BroadcastExchange, HashAggregate vs SortAggregate, predicate pushdown into FileScan, partition filters, and AQE rules applied.
2. Spark Tuning — Diagnostic Recipes
These are the recipes you should rehearse aloud.
2.1 Tuning checklist (memorize)
- Right file format: Parquet/ORC with snappy/zstd; target file size 128–512 MB.
- Partitioning on disk: low-cardinality cols (date, region). NOT high-cardinality (user_id).
- Bucketing when you join the same large tables repeatedly on the same key.
spark.sql.shuffle.partitions: tune so each shuffle partition processes ~128–256 MB.- AQE on:
spark.sql.adaptive.enabled=true, plus coalesce + skew-join. - Broadcast the small side; hint
broadcast(df)if Catalyst doesn't pick it up. - Avoid UDFs when SQL functions exist; if needed, Pandas UDFs.
- Predicate pushdown: filter before join; let Parquet stats prune.
- Column pruning:
selectonly what you need, early. - Cache only re-used DataFrames; prefer
MEMORY_AND_DISK_SERfor large. - Dynamic Partition Pruning for star-schema joins.
- Speculation (
spark.speculation=true) catches stragglers. spark.executor.memoryOverheadfor PySpark / native libs (10–20 %).- Avoid
collect()on big DFs; usetoLocalIterator()or write out. - Z-order / sort within partitions for selective queries (Iceberg/Delta).
2.2 The "this job is slow" interview answer
1. I would not guess. I'd open Spark UI.
2. SQL tab → physical plan, join strategy, scan size, partition filters.
3. Stages tab → shuffle read/write, spill, GC time, max vs median task time.
4. Executors tab → GC %, OOM, memory used, failed tasks.
5. Map symptom to fix:
huge scan → pushdown / partitioning / column pruning
huge shuffle → broadcast / bucket / AQE / skew
stragglers → AQE skew, salting, speculation
GC time high → more partitions, persist DISK_SER, raise memoryOverhead
small files → repartition before write, compaction
6. I'd validate the fix by re-checking the same UI metrics.
2.3 Q&A with reasoning
Q. Why does Dynamic Partition Pruning help star schemas? A fact-dim equi-join with a filter on the dim can be optimized: Spark first runs the dim filter, collects the matching dim keys, and pushes them as a partition filter on the fact scan. The fact scan only reads partitions that survived. Saves enormous I/O on large facts.
Q. Why are small files harmful on object stores? Each open + LIST has fixed overhead, throughput is per-prefix, and metadata grows. A 10 TB table in 1 KB files is unusable. Target 128–512 MB; compact regularly.
Q. How does AQE detect skew at runtime? After each shuffle stage, AQE looks at partition sizes. If one partition is much larger than the median (configurable factor), it splits that partition into sub-partitions and replicates the matching rows from the other side of the join.
Part II — Storage and Lakehouse
3. Distributed Storage — Why Object Stores Need Lakehouses
Spark / Trino / Flink"] -->|"GET / PUT / LIST"| Obj["Object Store
S3 / ADLS / GCS"] Obj --> P["Per-prefix limits
~3.5k PUT, 5.5k GET / sec"] Obj --> R["No atomic rename
across multiple objects"] Obj --> Lc["Strong consistency now,
but not multi-object atomic"] R --> LH["Lakehouse commit
protocol fixes this"] Lc --> LH
3.1 HDFS vs object stores at a glance
| HDFS | S3 | ADLS Gen2 | GCS | |
|---|---|---|---|---|
| Hierarchy | true | flat | hierarchical (HNS) | flat |
| Atomic rename | yes (cheap) | no (copy + delete) | yes (HNS) | no |
| Append | yes | no | yes | no |
| Multi-region | no | with replication | with GRS | with multi-region |
3.2 Why "no atomic rename" matters
Old Hadoop committers wrote _temporary/.../task_attempt_X and then renamed to the final path on commit. On HDFS this is a metadata flip. On S3 it's O(file_count) × bytes of copy. With speculative execution, two attempts of the same task can both finish; the classic committer can leave both outputs.
That's why we have:
- S3A magic committer / directory committer — uses S3 multipart upload so the final commit is a single atomic operation.
- Lakehouse table formats (Iceberg/Delta/Hudi) — the "commit" is a metadata pointer swap in the catalog, not a file rename.
3.3 File format quick-take
| Format | Compression | Splittable | Schema evolution | Primary use |
|---|---|---|---|---|
| CSV / JSON | poor | partial | sloppy | landing only |
| Avro | good | yes | row-based, schema registry | Kafka topics, streaming |
| Parquet | great (snappy/zstd) | yes | column add/drop, by name | analytical default |
| ORC | great | yes | yes | Hive ecosystem |
3.4 Performance levers
- Avoid small files — compact to 128–512 MB Parquet.
- Partition layout:
tenant=.../date=.../hour=...for time-aligned scans. - Column projection + predicate pushdown with Parquet/ORC stats.
- Z-order / Hilbert for multi-column locality (Iceberg, Delta).
- Caching layer (Alluxio, EMRFS S3 cache) for repeated reads.
4. Lakehouse and Apache Iceberg — Tables on Top of Object Storage
4.1 The problem in one paragraph
Raw Parquet on S3/ADLS/GCS gives cheap storage. It does not give you atomic multi-file writes, safe schema evolution, row-level updates/deletes, time travel/audit, or fast pruning at petabyte scale. Iceberg, Delta, and Hudi all add a metadata layer that turns folders of files into transactional tables.
4.2 Iceberg layout — every level matters
atomic CAS pointer"] --> M0["metadata/v3.metadata.json
(current snapshot pointer)"] M0 --> ML["Manifest list (snap-xxx.avro)
+ partition summary stats"] ML --> M1["Manifest 1
+ per-file stats"] ML --> M2["Manifest 2"] M1 --> D1["data/part=2026-04-30/file-1.parquet"] M1 --> D2["data/part=2026-04-30/file-2.parquet"] M2 --> D3["data/part=2026-04-29/file-3.parquet"]
- Catalog holds the only pointer to the current
metadata.json— that's where atomic CAS happens. metadata.jsonrecords schema (with column IDs), partition spec, sort order, and snapshots.- Manifest list holds per-manifest partition-summary stats so a planner can prune whole manifests.
- Manifests list data files plus per-file stats: row counts, null counts, min/max per column.
- Data files are Parquet/ORC.
4.3 Read path — why Iceberg scales beyond Hive
metadata.json"] C --> ML["Manifest list"] ML -->|"partition prune"| M["Surviving manifests"] M -->|"file-level prune (min/max, nulls)"| F["Surviving data files"] F -->|"column projection + row-group pruning"| Read["Read"]
Hive listed directories on S3 (slow, eventually consistent). Iceberg reads a small set of metadata files that already enumerate data files plus stats — most pruning happens before any data file is opened.
4.4 Write path and commit protocol
- Each task writes Parquet to
data/. - On task completion, manifest entries are assembled.
- Driver builds a new
metadata.jsonreferencing a new snapshot. - Driver calls catalog
commit(currentSnap, newMetadata)— atomic CAS at the catalog level. - On conflict (someone else committed first), the loser rebases changes onto the new snapshot and retries.
That's optimistic concurrency. It works because the only contention is the single catalog pointer.
4.5 Killer features and why they matter
| Feature | What it gives you |
|---|---|
| Hidden partitioning | Declare PARTITIONED BY (days(ts), bucket(16, user_id)); users query natural columns and Iceberg rewrites predicates. |
| Partition evolution | Change strategy without rewriting old data. |
| Schema evolution by column ID | Renames are safe — identity stays the same. |
| Time travel | FOR VERSION AS OF / FOR TIMESTAMP AS OF for audit and rollback. |
| Row-level ops | Merge-on-read (delete files, position deletes, equality deletes) or copy-on-write. |
| Multi-engine | Spark, Trino, Flink, Snowflake, BigQuery all read the same table via the REST catalog. |
4.6 Iceberg vs Delta vs Hudi
| Need | Best |
|---|---|
| Multi-engine (Spark + Trino + Flink + Snowflake) | Iceberg |
| Tightly coupled with Databricks ecosystem | Delta (UniForm makes it Iceberg-readable too) |
| High-velocity streaming upserts (CDC) | Hudi (or Iceberg + Flink) |
| Time-travel + branch/tag for data science | Iceberg with Nessie or native tags |
Hudi has two flavors: - Copy-on-Write rewrites whole files on update — fast reads, expensive writes. - Merge-on-Read appends delete/delta files merged at read — fast writes, slower reads until compaction.
4.7 Maintenance you cannot skip
| Job | Why |
|---|---|
expire_snapshots |
Bound metadata growth, free old data files. |
rewrite_data_files |
Compact small files into 128–512 MB chunks. |
rewrite_manifests |
Keep manifest count low for fast planning. |
remove_orphan_files |
Clean up files left by aborted writes. |
Skip these and the table eventually slows to a crawl.
4.8 Governance hooks (why this matters for catalogs)
Iceberg metadata is itself queryable:
SELECT * FROM catalog.db.table.snapshots;
SELECT * FROM catalog.db.table.files;
SELECT * FROM catalog.db.table.partitions;
SELECT * FROM catalog.db.table.history;
That powers lineage, audit, retention enforcement, and GDPR purges (DELETE + compact + expire snapshots).
4.9 Q&A with reasoning
Q1. Parquet vs Iceberg — same thing? No. Parquet is a file format (columnar layout inside one file). Iceberg is a table format (which files belong to the table at which snapshot, with what schema). Iceberg tables contain Parquet files.
Q2. Why does Iceberg scale where Hive directory listing doesn't? Hive reads tables by listing directories. Iceberg reads metadata files that already enumerate files plus min/max stats, so most pruning happens before any data file is opened.
Q3. How does Iceberg handle two writers at once? Optimistic concurrency: each writer prepares new metadata referencing the snapshot it read, then performs a single catalog-level CAS. The loser detects a conflict, rebases its changes onto the new snapshot, and retries.
Q4. Equality deletes vs position deletes? Position deletes identify rows by file + row index → cheap at read time (skip rows by offset). Equality deletes identify rows by predicate → require a join at read time. Position deletes are cheaper to read; equality deletes are easier to produce in streaming CDC.
Q5. How would you do GDPR delete on a 10 PB table?
1) DELETE FROM table WHERE user_id IN (...) — Iceberg writes equality/position delete files. 2) Run rewrite_data_files to compact, materializing the deletes physically. 3) expire_snapshots to remove old snapshots that still reference deleted rows. 4) remove_orphan_files. After step 3 the deleted data is unrecoverable through time travel.
Part III — Indexes and Search
5. Graph Databases — Why Lineage Is Naturally a Graph
Some questions are fundamentally about relationships: "what depends on this column, 5 hops downstream?" Recursive SQL CTEs become slow because every hop is another join. Graph databases store direct relationship pointers (index-free adjacency), so a hop is a pointer chase, not a join.
5.1 Property graph model for lineage
:Job"] -->|WRITES| D["Dataset orders
:Dataset"] D -->|HAS| C1["Column total
:Column"] D2["Dataset orders_eu
:Dataset"] -->|HAS| C2["Column eu_total
:Column"] C2 -->|DERIVED_FROM| C1 C2 -->|TAGGED_AS| T["Tag PII
:Tag"] U["User Alice
:User"] -->|OWNS| D P["Policy mask_eu
:Policy"] -->|GOVERNS| C2
5.2 Cypher essentials
// Anchor on an indexed node, bound depth
MATCH (start:Column {id:'sales.amount'})
MATCH (start)-[:DERIVED_FROM*1..5]->(down:Column)
RETURN DISTINCT down.id;
// Temporal lineage — what was upstream on a date?
MATCH (a:Dataset {id:$id})-[r:DERIVED_FROM]->(b)
WHERE r.valid_from <= datetime($asOf) <= coalesce(r.valid_to, datetime())
RETURN b;
// Shortest path between two assets
MATCH p = shortestPath(
(a:Dataset {id:'src'})-[*]->(b:Dataset {id:'dst'}))
RETURN p;
5.3 Things that explode at scale
- Unanchored traversals — start from indexed nodes, always.
- Supernodes — a
Tagconnected to millions of columns will blow up any traversal that touches it. Split by tenant/domain/time, or filter relationship types early. - Unbounded depth — always cap with
*1..N.
5.4 Q&A with reasoning
Q. Why is index-free adjacency fast? Once the start node is found via index, each hop follows stored relationship pointers. There's no "look up neighbors by foreign key" join cost like in relational databases.
Q. How do you model temporal lineage? Put valid_from and valid_to on relationships, or version snapshots. Then "what was upstream on 2026-04-30?" becomes a predicate filter.
Q. When is a graph DB the wrong choice? When the workload is mostly aggregations, transactional records, or wide tabular reporting. Use relational/Iceberg for source of truth and graph as a traversal index that can be rebuilt from the source of truth.
Q. What is a supernode and why is it bad? A node with extremely high degree (tag connected to millions of columns). Traversals through it explode. Mitigations: filter by relationship type at the supernode, partition the supernode by tenant/domain, or skip it in path expansions.
6. Elasticsearch — Inverted Index, Discovery, Hybrid Search
6.1 Why Elasticsearch exists
Full-text discovery needs ranking, tokenization, typo tolerance, filters, facets, and low latency across millions or billions of documents. SQL LIKE '%text%' scans too much and doesn't understand language.
6.2 Architecture
(primary)"] Sh --> Seg1["Lucene segment 1
(immutable)"] Sh --> Seg2["Lucene segment 2"] Sh --> Seg3["Lucene segment 3"] Seg1 -.->|"merge"| Big["Larger segment"] Seg2 -.->|"merge"| Big Sh -->|"replicate"| Sh2["Shard
(replica)"]
- A shard is a Lucene index. Segments are immutable inverted-index files; refresh creates new segments; merge consolidates.
- Primary shards scale indexing throughput; replicas scale read throughput and provide HA.
6.3 Inverted index + scoring
"customer" → [(d1, tf=3, pos=[1,5,10]), (d4, tf=1, pos=[2]), ...]
- BM25 scoring (default since 5.x).
- Analyzer = tokenizer + filters (lowercase, stemmers, synonyms).
- Mapping:
text(analyzed for full-text) vskeyword(exact for filters/aggregations/sort).
6.4 Document lifecycle
- Index → translog + in-memory buffer.
- Refresh (default 1 s) → new segment becomes searchable.
- Flush → fsync segments + clear translog.
- Merge → fewer larger segments.
force_mergeon read-only indexes.
6.5 Query DSL — what to actually write
{
"query": {
"bool": {
"must": [{"match": {"name": "customer"}}],
"should": [{"match": {"description": "PII"}}],
"filter": [{"term": {"tenant": "acme"}}, {"term": {"sensitivity": "high"}}],
"must_not": [{"term": {"deleted": true}}]
}
},
"aggs": {"by_tag": {"terms": {"field": "tags.keyword"}}}
}
filterdoesn't score and is cacheable — put exact-match conditions there.matchis analyzed;termis exact (use only onkeywordfields).- Multi-match for cross-field; phrase with slop; fuzzy for typo tolerance; k-NN for vector search (8.x+).
6.6 Capacity planning rules
- Shard size sweet spot 10–50 GB. Too small → overhead. Too large → slow recovery, slow merges.
- Heap ≤ 32 GB (compressed oops boundary); the rest of RAM should be OS file cache.
- Index lifecycle (ILM) — hot/warm/cold/frozen tiers.
6.7 Operational pitfalls
- Mapping explosion — dynamic mapping on user JSON creating thousands of fields. Use
dynamic: strictordynamic: false. - Deep pagination (
from + size > 10k) — usesearch_afterwith PIT. - Refresh interval of 1 s is wasteful for bulk loads — set to
-1then back. - Hot shards — node-level imbalance; use shard allocation awareness.
6.8 Q&A with reasoning
Q. Why limit JVM heap to 32 GB? Above that, JVM stops using compressed object pointers and you actually lose memory savings. Plus most ES performance comes from the OS file cache anyway.
Q. match vs term? match analyzes the query text (lowercases, stems) and searches the analyzed field; term is exact byte match against the indexed token. Use term on keyword fields, not on text.
Q. What is hybrid retrieval? BM25 (keyword) + dense vector (semantic), fused with Reciprocal Rank Fusion (RRF). BM25 catches exact terms; dense catches paraphrases. Together they're more robust than either alone.
Part IV — AI / ML
7. LLMs — What's Actually Inside the Box
7.1 The honest definition
An LLM is a neural network trained to predict the next token given previous tokens. Everything that looks like reasoning, summarization, or chat emerges from next-token prediction plus instruction tuning and alignment. It is not a database. It compresses patterns into weights — that's why it can hallucinate.
7.2 Transformer architecture
positional encoding
(RoPE / ALiBi)"] E --> L1["Layer 1:
Multi-head Self-Attention
+ FFN + Norm (RMSNorm)"] L1 --> L2["Layer 2 ... N"] L2 --> O["Output projection
then softmax
then next token"] O -->|"append, repeat"| T L1 -.->|"cache K, V"| KV["KV cache
(per layer)"] L2 -.-> KV
- Self-attention: . Cost is in sequence length.
- Multi-head: project into
hheads, run attention in parallel, concatenate. - Decoder-only (GPT, Llama, Mistral, Claude) — generation. Encoder-only (BERT) — embeddings/classification. Encoder-decoder (T5) — translation/summarization.
- Modern: RMSNorm, SwiGLU activation, RoPE position encoding, GQA (grouped-query attention) to shrink KV cache.
7.3 Why long context is expensive — and what saves it
- Attention is in sequence length, so a 128k-token prompt costs ~16k× more than a 1k prompt for the attention matrix alone.
- KV cache stores past keys/values so generation step
tdoesn't recompute attention overt-1previous tokens — turns autoregressive generation from to per step. - FlashAttention is an IO-aware kernel that tiles attention to keep working sets in SRAM and avoids materializing the full matrix.
- RoPE scaling / NTK / YaRN extend a model's effective context length without retraining from scratch.
- Mixture-of-Experts (MoE): only a subset of expert FFNs activate per token (e.g., 2 of 8). Throughput per parameter improves dramatically.
7.4 Embeddings and ANN search
An embedding maps text to a dense vector (typically 768–3072 dimensions). Similarity is cosine (after normalization, equivalent to dot product). Searching billions of vectors fast uses approximate indexes:
| Index | Idea | Notes |
|---|---|---|
| HNSW | hierarchical navigable small-world graph | default in pgvector, FAISS, Elastic, Weaviate, Pinecone |
| IVF-PQ | coarse cluster + product quantization | memory-efficient |
| ScaNN | partition + quantization | Google's |
The trade-off triangle: recall ↔ latency ↔ memory. Tune efSearch/nprobe to move along that triangle.
7.5 Inference optimizations
- Quantization — INT8 / INT4 (GPTQ, AWQ); 2–4× speed/memory at small quality loss.
- Speculative decoding — a small draft model proposes tokens; the big model verifies in parallel.
- Continuous batching (vLLM PagedAttention) — 5–20× throughput improvement.
- Distillation — train a smaller student from a larger teacher.
7.6 Fine-tuning vs RAG
| Goal | Tool |
|---|---|
| Inject up-to-date private knowledge | RAG |
| Change tone, format, domain style | LoRA / QLoRA |
| Teach a brand-new skill where prompting fails | full SFT |
| Align with human preferences | RLHF / DPO |
Default: start with RAG. Add LoRA only if RAG can't fix it. Full SFT is rarely worth it for enterprise apps.
7.7 Q&A with reasoning
Q. Why is attention ? Every token computes a similarity score against every other token. n queries × n keys = pairs.
Q. Why does the KV cache help? During autoregressive generation, attention only changes for the new token. Caching K and V for prior tokens makes each step instead of .
Q. What does FlashAttention break? The memory wall — it avoids writing the full attention matrix to HBM, instead streaming tiles through fast SRAM. Same math, much less DRAM traffic.
Q. Why do tokenizers matter for cost? Pricing and context limits are per token. A column name like customer_account_id_v2 may tokenize into 5–7 tokens; rare strings cost more.
Q. Why does normalization matter for cosine vs dot product? If vectors aren't unit-length, dot product favors longer vectors regardless of direction. Normalize first; then cosine ≡ dot.
Q. RLHF vs DPO? Both align the model to preferences over candidate pairs. RLHF trains a reward model and uses PPO. DPO skips the reward model and directly optimizes a preference loss — simpler, often more stable, comparable quality.
8. RAG and Agentic Systems — Models That Reach for Data and Tools
8.1 Vanilla RAG pipeline
HyDE expansion"] QR --> H["Hybrid retrieval
BM25 + dense
+ tenant filter"] H --> RR["Reranker
(cross-encoder)"] RR --> P["Prompt assembly
system + chunks + query"] P --> LLM["LLM"] LLM --> A["Answer + citations"] Idx[("Vector index
+ keyword index")] --> H Docs["Source docs / metadata"] -->|"chunk + embed"| Idx
8.2 Quality levers, in order of impact
- Chunking — semantic, with 10–20 % overlap, with metadata (source, owner, tenant, sensitivity, timestamp). For tabular metadata, chunk = a dataset card with name, description, key columns, owner, tags.
- Hybrid retrieval — BM25 + dense, fused via Reciprocal Rank Fusion.
- Query rewriting — HyDE (have the LLM imagine an answer, embed it, retrieve real docs by similarity), multi-query, decomposition.
- Reranker — cross-encoder (bge-reranker, Cohere rerank) over top-50 → top-K.
- Pre-retrieval metadata filters — tenant, time, sensitivity. Always before retrieval, not after.
- Prompt — citations format, structured output via JSON schema or function calling.
- Caching answers by
(tenant, query_hash, content_hash)with stale-while-revalidate.
8.3 Why hybrid retrieval beats either alone
- Dense embeddings nail semantic paraphrase (
"client identifier" → customer_id). - BM25 nails exact identifiers, code symbols, rare terms (
PII_CUSTOMER_EMAIL_V2). - RRF is parameter-free and robust:
score(d) = Σ over rankers 1 / (k + rank_r(d)) # k often 60
8.4 Evaluation (treat it as a test suite)
| Layer | Metrics |
|---|---|
| Retrieval | recall@k, precision@k, MRR, NDCG |
| Generation | faithfulness, answer relevance, context precision/recall |
| Agent | task success, tool error rate, iterations, cost per task |
Tools: Ragas, TruLens, DeepEval, Langfuse. Use LLM-as-judge with a calibration set, plus periodic human review.
8.5 Agentic loop
(call a tool)"] A --> O["Observe
(tool result)"] O --> R["Reflect
done? need more?"] R -->|"continue"| P R -->|"stop"| Out["Final answer
+ trace"]
Useful agent = LLM + a small set of typed tools + a loop with hard limits:
- max tool calls per task,
- token + dollar budget,
- wall-clock timeout,
- deterministic fallbacks for parts that should never depend on an LLM.
8.6 Worked example — autonomous PII classifier
(SSN, email, IBAN)"] Reg -->|"hit"| Done["Auto-tagged
+ rationale"] Reg -->|"miss"| Dict["Dictionary
(cities, names, codes)"] Dict -->|"hit"| Done Dict -->|"miss"| ML["ML / NER classifier"] ML -->|"high conf"| Done ML -->|"ambiguous"| Sample["profile_column
(Spark sample)"] Sample --> VS["vector_search
(similar past columns)"] VS --> J["LLM Judge:
aggregate + rationale + confidence"] J -->|"conf >= threshold"| Write["write_to_catalog
(Iceberg + graph)"] J -->|"conf < threshold"| Steward["Human steward
review"] Steward --> Train["labeled example
weekly retrain"] Train --> ML
Design rules to defend in interview:
- Determinism first, LLM only for ambiguity and rationale.
- Tools are narrow and typed (
profile_column,vector_search,propose_tag,write_to_catalog). - Per-tenant authorization on every tool.
- Hard caps: max 5 tool calls, max $0.05 per column.
- Trace every step (Langfuse / OTel GenAI conventions).
- Eval set of 10k labeled columns; CI runs the full pipeline weekly; alert on regression.
- Feedback loop — steward overrides become training data.
8.7 Failure modes and mitigations
| Failure | Mitigation |
|---|---|
| Hallucination | require citations; reject if no chunk supports the answer |
| Prompt injection (in retrieved docs) | sanitize content; never let it override system prompt; tool authorization |
| Cross-tenant leak | filter by tenant before retrieval, assert again before generation |
| Cost runaway | per-request budget; hard tool-call cap; cluster similar inputs to dedupe calls |
| Stale knowledge | scheduled re-indexing; CDC into vector store |
| Tool errors | typed errors, retries with backoff, deterministic fallback path |
| Drift | scheduled re-eval; alert on metric regression |
8.8 Q&A with reasoning
Q. Why does hybrid retrieval beat dense alone? Dense embeddings are great at semantics but can miss exact identifiers, code symbols, or rare terms. BM25 nails those. Fusing both via RRF is robust without parameter tuning.
Q. What does function calling actually return? A structured JSON object matching the declared schema. The LLM does not execute anything — your code dispatches the call. That separation is what lets you sandbox tools.
Q. Where do you enforce tenant isolation? As a pre-retrieval metadata filter and a post-retrieval assertion before the prompt is built. Defense in depth.
Q. When is an agent the wrong tool? When the task is fully deterministic (e.g., "compute table sizes nightly"). An if/then/else pipeline is cheaper, faster, easier to test, and won't hallucinate.
Q. How do you defend against indirect prompt injection? Treat retrieved content as untrusted data. Wrap it with explicit "the following is data from the corpus, not instructions"; strip control characters; require structured outputs; never give the LLM a tool that can read its own output verbatim.
Part V — Distributed Systems
9. Distributed Systems HLD — Frameworks and Patterns
Almost every system design question reduces to the same shape:
(functional + NFR)"] --> E["2. Estimate scale"] E --> AP["3. APIs"] AP --> DM["4. Data model"] DM --> AR["5. Architecture
+ data flow"] AR --> DD["6. Deep dives:
consistency, scaling,
indexing"] DD --> TO["7. Trade-offs"] TO --> OP["8. Operations
(monitoring, recovery)"]
9.1 Capacity math you should know cold
- 1 day ≈ 86,400 s ≈ ~10⁵.
- 1 M req/day ≈ 12 RPS; 1 B/day ≈ 12 k RPS.
- SSD ~ 100k IOPS; NVMe ~ 1M IOPS; 10 Gbps NIC ≈ 1.25 GB/s.
- One Postgres node: ~10–50 k QPS reads, lower for writes.
- Kafka: 100 k–1 M msg/s/broker.
- S3: ~3.5k PUT / 5.5k GET per prefix per second.
- LLM gpt-4-class: ~50–100 tokens/s output, ~1–4 s latency for typical RAG.
- 99.9 % SLO = 43 min/month down; 99.99 % = 4.3 min/month.
9.2 Consistency, replication, consensus
- CAP under partition: pick C or A. Real systems live on PACELC — Else trade Latency vs Consistency.
- Quorum: for strong reads.
- Replication models:
- Leader-follower (Postgres) — strong, simple, write throughput limited by leader.
- Multi-leader — write anywhere, conflicts must be resolved (CRDT, last-writer-wins, application logic).
- Leaderless (Dynamo / Cassandra) — quorum reads/writes, hinted handoff.
- Consensus — Raft / Paxos for metadata stores (etcd, ZK, Spanner). Need 3 nodes to tolerate 1 failure, 5 to tolerate 2.
- Time — NTP/PTP, hybrid logical clocks, vector clocks, Lamport timestamps, TrueTime (Spanner).
9.3 Building blocks (2026 default picks)
| Block | Tech |
|---|---|
| Load balancer | Envoy, ALB, GCLB |
| API gateway | Kong, Apigee, AWS API GW |
| Cache | Redis Cluster, Memcached |
| Pub/sub | Kafka, Pulsar, Kinesis, Pub/Sub |
| Queue | SQS, RabbitMQ, Service Bus |
| OLTP DB | Postgres / Aurora / Cloud SQL / Spanner / CockroachDB |
| Wide-column / KV | DynamoDB, Cassandra, ScyllaDB, Bigtable |
| Search | Elasticsearch / OpenSearch |
| Vector | pgvector, Pinecone, Weaviate, Elastic k-NN |
| Lake / Lakehouse | S3 + Iceberg + Trino / Spark |
| DW | Snowflake / BigQuery / Redshift |
| Stream proc | Spark Structured Streaming, Flink |
| Observability | Prom + Grafana + OTel + Loki/Tempo |
| Coordination | etcd, ZK, Consul |
9.4 Patterns you'll reach for
- Outbox pattern — atomic DB write + event emit. Eliminates the dual-write inconsistency.
- Saga — cross-service transactions with compensating actions when 2PC isn't viable.
- Idempotency keys — make at-least-once safe.
- CQRS / event sourcing — separate write model from read model.
- Backpressure + bounded queues.
- Retry + exponential backoff + jitter + circuit breaker + bulkhead.
- Cache stampede mitigation: request coalescing, jittered TTL, stale-while-revalidate.
- Hot key — salt, secondary indexes, replicate.
- Schema migration without downtime — expand → migrate → contract.
- Backfills without overload — rate-limited, idempotent, resumable.
9.5 The outbox pattern in detail
(reads outbox table)"] Relay -->|"publish"| K["Kafka topic"] K --> C1["Consumer A
(idempotent)"] K --> C2["Consumer B
(idempotent)"]
What it solves: the dual-write problem. If you write to the DB and then publish to Kafka as two separate calls, a crash between them leaves the system inconsistent. Outbox writes the event row in the same DB transaction as the business row. A separate relay reads outbox entries and publishes them; consumers must be idempotent (dedup by event id).
9.6 Worked example — multi-tenant lineage service
Spark / Airflow / dbt"] -->|"OpenLineage RunEvents"| K["Kafka
(topic per tenant)"] K --> Ing["Streaming ingestor
validate, dedupe, enrich"] Ing --> Ice[("Iceberg
raw run events
(audit + replay)")] Ing --> G[("Graph store
Neo4j / Neptune")] Ice --> Q["Query API"] G --> Q Q --> UI["UI / agents
upstream/downstream,
column-level diff,
impact analysis"]
Why two stores? Iceberg is the durable source of truth (cheap, replayable, auditable). The graph store is a traversal index that can be rebuilt from Iceberg if it corrupts. Writes go to both; the graph is eventually consistent with Iceberg.
9.7 Common HLD designs to be ready for
- Distributed metadata catalog (Google-scale).
- Lineage service spanning lakehouse + DW + dashboards.
- PII discovery / classification platform.
- Real-time data quality scoring on streams.
- Multi-tenant search service.
- Vector search at scale (multi-tenant RAG).
- Idempotent ingestion service with exactly-once into Iceberg.
- Job orchestrator for Spark DAGs with retries and SLA.
- Distributed rate limiter / API gateway.
- CDC pipeline Postgres → Kafka → Iceberg → DW.
- Feature store for ML.
- Observability platform.
9.8 Q&A with reasoning
Q. What does the outbox pattern solve? The dual-write problem. You want to update a DB row and publish an event. If you do them as two separate calls, a crash between them leaves the system inconsistent. Outbox writes the event row in the same DB transaction, then a relay reads the outbox table and publishes. Consumers must be idempotent.
Q. At-least-once vs exactly-once? At-least-once may deliver duplicates after retry. "Exactly-once" is really effectively once: at-least-once + idempotent consumer (dedup via key) or transactional sink (Kafka transactions, Iceberg/Delta MERGE).
Q. Why use consistent hashing? When a node is added or removed, only K/N keys move (where N is number of nodes), instead of remapping nearly everything. Virtual nodes smooth load distribution.
Q. How do you avoid hot-key on a global counter? Bucket the key into many shards (counter:42#0..M-1), increment a random shard, sum periodically. Trade read cost for write scalability.
Q. Why is R + W > N the strong-read condition? Because read and write quorums then overlap on at least one replica, so a read is guaranteed to see the latest committed write — provided you have a tie-breaker (vector clocks, version numbers, last-writer-wins).
Q. Saga vs 2PC? 2PC is synchronous and blocking — coordinator failure can leave participants stuck holding locks. Saga is async, with compensating actions for each step; eventual consistency, no global lock. Saga fits microservices; 2PC fits a single trusted coordinator on top of capable resource managers.
10. SQL and RDBMS — Indexes, Plans, Isolation
10.1 Mental model
A database must answer three hard questions:
- How do I store data safely? → transactions, WAL, durability.
- How do I find data quickly? → indexes, statistics, optimizer.
- How do I handle many users at once? → isolation, locks, MVCC.
Most SQL performance reduces to: how many rows must we touch, and can an index reduce that?
10.2 ACID + isolation levels
| Level | Dirty | Non-repeatable | Phantom |
|---|---|---|---|
| Read Uncommitted | yes | yes | yes |
| Read Committed (PG default) | no | yes | yes |
| Repeatable Read (MySQL InnoDB default) | no | no | yes (PG: no, via SSI) |
| Serializable | no | no | no |
- MVCC (Postgres, Oracle, InnoDB) — readers don't block writers via tuple versions + visibility map.
- Long-running transactions prevent vacuum cleanup → table bloat.
10.3 Indexes — what they're for
| Index | Use |
|---|---|
| B-Tree | range + equality |
| Hash | equality only |
| GIN/GiST (PG) | full-text, JSONB, geospatial |
| BRIN | block range, perfect for append-only time series |
| Bitmap | low cardinality (warehouse OLAP) |
| Covering / INCLUDE | index includes payload columns; avoids heap fetch |
| Composite | leftmost-prefix rule |
Selectivity rule: indexes pay off when the filter selects < ~5–10 % of rows.
10.4 Why WHERE date_trunc('day', ts) = '2026-04-30' is slow
Applying a function to the indexed column prevents direct B-tree lookup. Rewrite as a range:
WHERE ts >= '2026-04-30' AND ts < '2026-05-01'
Now Postgres does a B-tree range scan.
10.5 Reading EXPLAIN ANALYZE
Operators to recognize:
- Seq Scan — full table; expected on tiny tables, scary on big ones.
- Index Scan / Index-Only Scan — wins when covering index avoids heap.
- Bitmap Heap Scan — combine multiple indexes, then fetch.
- Hash Join — build hash on smaller side.
- Merge Join — both sides sorted on join key.
- Nested Loop — fast when outer is small and inner is indexed.
Look for estimated vs actual rows — if they diverge by 10×+, run ANALYZE to refresh stats; the optimizer is making bad decisions.
10.6 Online schema migration — expand / contract
(nullable, default lazy)"] --> B["Backfill in batches
(rate-limited, resumable)"] B --> C["App writes BOTH
old + new columns"] C --> D["App reads NEW
(verify parity)"] D --> E["Drop old column"]
Never block production with a NOT NULL DEFAULT on a billion-row table.
10.7 Q&A with reasoning
Q. How does a composite index (a, b, c) help queries? Usable for filters on a, (a, b), or (a, b, c) — not for b alone (leftmost-prefix rule). Order columns by selectivity and common filter shape.
Q. Why is UUID a worse PK than BIGINT for clustered indexes? UUIDs are random → inserts hit random pages → page splits, fragmentation, poor cache locality. ULID / UUIDv7 (time-ordered) restore sequentiality.
Q. Hash join vs merge join vs nested loop — when each? - Nested loop — small outer + indexed inner; O(outer × log inner). - Hash join — equi-joins, both moderate; build hash on smaller side, O(N+M). - Merge join — equi-join with both sides already sorted on the key; O(N+M) without hash memory.
Q. Why does VACUUM FULL block writers? It rewrites the table to reclaim space and takes an ACCESS EXCLUSIVE lock. Use pg_repack for online reclamation.
Part VI — Modeling and Governance
11. Data Modeling — Choose by Query Shape
The right model depends on who asks what, not on which model is fashionable.
11.1 OLTP vs OLAP
| OLTP | OLAP | |
|---|---|---|
| Goal | transactions | analytics |
| Schema | 3NF, normalized | denormalized (star/snowflake) |
| Workload | small point reads/writes | wide scans, aggregations |
| Engine | PG / MySQL / Oracle | Snowflake / BigQuery / Iceberg |
11.2 Dimensional (Kimball) modeling
- Fact tables — measurements + FKs to dimensions; e.g.
sales_fact (sale_id, date_key, product_key, store_key, qty, amount). - Dimension tables — descriptive context (product, customer, date, store).
- Star = fact + flat dims. Snowflake = normalized dims (more joins, less duplication).
- Grain — the most atomic level a fact row represents. Define grain first.
- SCD types:
- Type 1 — overwrite, no history.
- Type 2 — new row +
effective_from/to,is_current. - Type 3 — column for previous value.
- Type 4 — separate history table.
- Type 6 — 1+2+3 hybrid.
11.3 Other modeling approaches
- Data Vault 2.0 — Hubs (business keys), Links (relationships), Satellites (descriptive + history). Insert-only, auditable; great for raw enterprise lakes feeding star marts downstream.
- One Big Table (OBT) — wide tables on Iceberg/BigQuery for ML/feature engineering.
- Graph modeling — for lineage, entity resolution, knowledge graphs.
11.4 Modeling decision table
| Need | Choose |
|---|---|
| Auditable raw history | Data Vault / event log |
| BI dashboards (fast, simple) | Star schema |
| ML feature engineering | OBT / feature store |
| Lineage queries | Graph |
| Governance metadata | Hybrid: relational core + graph for lineage |
11.5 Q&A with reasoning
Q. What is grain and why does it matter? Grain is what one fact row represents. If grain is unclear, aggregations either double-count or lose meaning. Always define grain first.
Q. SCD Type 2 implementation in Spark + Iceberg? MERGE with two clauses: when matched and any tracked attr changed, close the existing row (is_current=false, effective_to=now()) and insert a new current row; when not matched, insert. Use a stable surrogate key separate from the natural key.
Q. When is Data Vault the right answer vs Kimball? Data Vault for raw, auditable, source-system-faithful layers in large enterprise lakes where multiple inconsistent sources land. Build Kimball star marts on top for analyst consumption.
12. Data Governance — Catalog, Quality, Lineage, Privacy
12.1 The six questions a governed platform must answer
- What data do we have? — catalog
- Who owns it? — stewardship
- Can I trust it? — quality
- Where did it come from? — lineage
- Who can access it? — policy / RBAC / ABAC
- Is it sensitive? — classification / PII detection
12.2 Architecture of a modern catalog
Iceberg / Snowflake / RDBMS / Kafka /
BI tools / Spark + Airflow runs"] -->|"crawlers + OpenLineage"| Ing["Ingest
schema, stats, runs, lineage"] Ing --> Ent[("Entity store
Iceberg / RDBMS")] Ing --> Idx[("Search index
Elasticsearch")] Ing --> Gph[("Graph
Neo4j / Neptune")] Ent --> Svc["Services:
Search · Lineage · Quality ·
Classification · Policy · Glossary · Stewardship"] Idx --> Svc Gph --> Svc Svc --> UI["UI / API / AI agents"]
Why three indexes? Because three different access patterns: - Entity store for canonical "what is this asset?" - Search index for "find me anything about X." - Graph for "what's upstream/downstream of X?"
12.3 Metadata types (memorize)
- Technical — schema, types, partitioning, formats, sizes.
- Business — glossary terms, domains, owners, descriptions.
- Operational — ingestion times, freshness, last-updated, run logs.
- Social — usage, ratings, certified status.
- Lineage — who reads / writes / derives.
- Quality — rule definitions, scores, incidents.
12.4 Data quality dimensions
| Dimension | Example check |
|---|---|
| Completeness | % non-null on critical columns |
| Validity | values match regex / domain / enum |
| Accuracy | matches a reference (gold) source |
| Uniqueness | PK has no duplicates |
| Timeliness | data freshness < SLA |
| Consistency | cross-table referential integrity |
| Reasonableness | distribution / volume in expected range |
12.5 Lineage granularity
- Table-level — jobs read/write tables (cheap, coarse).
- Column-level — which output column derives from which input columns. Built by parsing SQL or capturing Spark logical plans (OpenLineage's Spark integration).
- Row-level — full provenance per record. Expensive; rarely worth it.
12.6 OpenLineage — the standard
Three core entities: Job, Run, Dataset (inputs/outputs). RunEvents emit start/complete/fail. Spark / Airflow / Flink / dbt have integrations. Backends: Marquez, Datahub, OpenMetadata.
12.7 PII classification — layered, not "just call the LLM"
(SSN, email)"] R -->|"hit"| Done["Auto-tagged"] R -->|"miss"| Dict["Dictionary
(cities, names)"] Dict -->|"hit"| Done Dict -->|"miss"| ML["ML / NER classifier"] ML -->|"high conf"| Done ML -->|"ambiguous"| LLMJ["LLM judge
+ rationale"] LLMJ -->|"low conf"| Steward["Human steward
review"] Steward --> Done
Deterministic layers handle the obvious 80%; the LLM is only invoked for ambiguity, and a human approves low-confidence cases. This keeps cost, latency, and audit trails sane.
12.8 Policy enforcement
- Tag-based access — "if tagged PII and user not in privacy_role, mask."
- Row-level filters — tenant/region.
- Column masking — hash, redact, tokenize.
- Enforced in compute (Snowflake / Databricks / Trino) via integrations or proxies (Immuta, Privacera).
12.9 Q&A with reasoning
Q. RBAC vs ABAC — what's the practical difference? RBAC assigns permissions to roles; users inherit role permissions. ABAC evaluates attributes of subject, resource, and environment in a policy expression. ABAC composes better at scale because you don't get a role explosion when permissions depend on tenant, region, sensitivity, time, etc.
Q. What is a "data contract"? A versioned, testable agreement between producer and consumer covering schema, semantics, freshness, and SLAs. Enforced in CI: schema changes that break the contract fail the build.
Q. Why graph for lineage rather than recursive SQL? Multi-hop relationship traversal is what graph databases are built for. Recursive CTEs work for shallow lineage but cost grows poorly with depth and edge count.
Q. Where does OpenLineage capture column-level lineage? A Spark listener intercepts the logical plan before execution and emits a column-resolution graph. SQL parsing alone misses dynamic SQL or UDFs — the runtime listener is required for fidelity.
Part VII — Algorithms, Languages, Platform
13. Algorithms and Data Structures — The Patterns That Repeat
Don't memorize 200 problems. Learn the recognition table.
13.1 Pattern recognition
| Signal in the problem | Pattern |
|---|---|
| contiguous subarray / substring | sliding window, prefix sum |
| sorted input or sorting helps | two pointers, binary search |
| top-K / Kth largest / smallest | heap |
| relationships / connectivity | BFS, DFS, Union-Find, topological sort |
| optimal value with overlapping subproblems | dynamic programming |
| enumerate all valid configurations | backtracking |
| huge stream, bounded memory | reservoir sampling, Bloom filter, HLL, Count-Min Sketch |
13.2 Big-data sketches
- HyperLogLog — approximate distinct count in ~16 KB at 1 % error. Used by Spark's
approx_count_distinct. - Count-Min Sketch — approximate frequency. Heavy-hitter detection.
- Bloom filter — probabilistic set membership. False positives possible, false negatives never. Used inside Parquet row groups, RocksDB, ad-blocking.
- MinHash + LSH — near-duplicate detection at scale (entity resolution).
13.3 Patterns in code
Sliding window — longest substring without repeating chars (O(n))
def length_of_longest(s):
seen = {}
left = best = 0
for right, c in enumerate(s):
if c in seen and seen[c] >= left:
left = seen[c] + 1
seen[c] = right
best = max(best, right - left + 1)
return best
Top-K with a min-heap of size K (O(n log k))
import heapq
def top_k(stream, k):
h = []
for x in stream:
if len(h) < k:
heapq.heappush(h, x)
elif x > h[0]:
heapq.heapreplace(h, x)
return sorted(h, reverse=True)
Why a min-heap of size k? The smallest of the top-K is on top; you only displace it if a new element is larger.
Union-Find with path compression + union by rank (~O(α(n)) per op)
class DSU:
def __init__(self, n):
self.p = list(range(n))
self.r = [0] * n
def find(self, x):
while self.p[x] != x:
self.p[x] = self.p[self.p[x]] # path compression
x = self.p[x]
return x
def union(self, a, b):
a, b = self.find(a), self.find(b)
if a == b:
return False
if self.r[a] < self.r[b]:
a, b = b, a
self.p[b] = a
if self.r[a] == self.r[b]:
self.r[a] += 1
return True
Used for connectivity (lineage clusters, account merging, friend graphs).
Topological sort — Kahn's algorithm (O(V+E))
from collections import deque
def topo(graph): # graph: {u: [v, ...]}
indeg = {u: 0 for u in graph}
for u in graph:
for v in graph[u]:
indeg[v] = indeg.get(v, 0) + 1
indeg.setdefault(u, 0)
q = deque([u for u, d in indeg.items() if d == 0])
order = []
while q:
u = q.popleft()
order.append(u)
for v in graph.get(u, []):
indeg[v] -= 1
if indeg[v] == 0:
q.append(v)
if len(order) != len(indeg):
raise ValueError("cycle detected")
return order
Used for job DAGs, compile dependencies, lineage recompute order.
LRU Cache (OrderedDict)
from collections import OrderedDict
class LRU:
def __init__(self, cap):
self.cap = cap
self.d = OrderedDict()
def get(self, k):
if k not in self.d:
return -1
self.d.move_to_end(k)
return self.d[k]
def put(self, k, v):
if k in self.d:
self.d.move_to_end(k)
self.d[k] = v
if len(self.d) > self.cap:
self.d.popitem(last=False)
13.4 Q&A with reasoning
Q. Why is heapifying a list O(n), not O(n log n)? Sift-down from the bottom up — leaves are already heaps, and most nodes are near the bottom (cheap sift). The geometric series sums to O(n).
Q. When does a Bloom filter return a false negative? Never. It only has false positives. That's why it's used as a negative cache — a "no" is reliable, a "yes" requires confirmation.
Q. BFS vs DFS for shortest unweighted path? BFS — it expands by distance layer, so the first time you reach the target you've used the fewest hops.
Q. Why does Dijkstra fail on negative edges? It commits to the shortest known distance once a node is finalized; a later negative edge could discover a shorter path. Use Bellman-Ford for negatives (and detect negative cycles).
Q. How does HyperLogLog estimate distinct counts? It hashes each element, looks at the run of leading zeros in the hash, and tracks the maximum across many "registers" partitioned by hash prefix. The harmonic mean of 2^max_zeros across registers estimates cardinality. ~16 KB gives ~1 % error for billions of unique elements.
14. Java and the JVM — What Production Engineers Must Know
14.1 Memory model
JVM Process
├── Method Area / Metaspace (class metadata, off-heap)
├── Heap
│ ├── Young Gen (Eden + 2 Survivors)
│ └── Old Gen (Tenured)
├── Per-thread: PC reg, JVM stack, native stack
└── Direct memory (NIO, off-heap caches like Tungsten, Arrow)
- Objects are born in Eden; survive minor GC → Survivor → eventually Tenured.
- Metaspace replaces PermGen (Java 8+); it grows in native memory. Bound it with
-XX:MaxMetaspaceSize.
14.2 Garbage collectors
| GC | Pause | Throughput | Heap sweet spot | Use case |
|---|---|---|---|---|
| Serial | High | Low | <100 MB | Tiny / CLI |
| Parallel (Throughput) | High | Highest | <8 GB | Batch / Spark executors |
| G1GC | Predictable | Good | 4–32 GB | General services; default since Java 9 |
| ZGC | <1 ms | Good | 8 GB – 16 TB | Low-latency services, very large heaps |
| Shenandoah | <1 ms | Good | Large | Latency-critical |
Key flags: -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=35.
14.3 Java Memory Model — happens-before
volatile→ visibility + ordering, not atomicity. UseAtomicLong/LongAdderfor atomic counters.synchronized→ mutual exclusion + happens-before.final→ safe publication of immutable fields after constructor exits.- Locks —
ReentrantLock(fairness, tryLock),ReadWriteLock,StampedLock.
14.4 Concurrency toolkit
ConcurrentHashMap— striped locking pre-Java 8, CAS + bin-level sync from Java 8.CompletableFuture— async pipelines (thenCompose,thenCombine).- Executors —
newFixedThreadPool,ForkJoinPool(work-stealing), and virtual threads (Project Loom, Java 21) for massive I/O concurrency without callback hell. - Backpressure — bounded queues +
RejectedExecutionHandler(CallerRuns is the safe default for ingestion).
14.5 Common pitfalls
- Unbounded
Executors.newCachedThreadPool→ OOM under burst. String.intern()in hot path → metaspace pressure.synchronizedon autoboxedInteger/Long→ cross-thread aliasing bug.- Humongous allocations in G1 (objects ≥ 50% of region size) bypass Eden, hurt collection.
try-with-resourcesmust close in reverse order — leaked file descriptors crash the JVM.- Virtual threads can be pinned by
synchronizedblocks holding I/O. UseReentrantLockif blocking inside lock is expected.
14.6 Q&A with reasoning
Q. Why does GC affect p99 latency more than average? Stop-the-world pauses are visible only to requests unlucky enough to hit them. Increasing the heap reduces frequency of full GCs but increases their duration. ZGC / Shenandoah trade some throughput for sub-ms pauses.
Q. Why is volatile not enough for a counter? volatile gives visibility but not atomicity — count++ is read-modify-write, three steps, not atomic. Use AtomicLong (or LongAdder under high contention).
Q. PySpark executor is killed by YARN/K8s but heap looks fine. Why? Python worker memory and native libs (Arrow, ONNX, etc.) live in memoryOverhead, outside the heap. Increase spark.executor.memoryOverhead.
Q. What's a humongous object in G1? Object ≥ 50 % of region size. Allocated directly into old gen → may fragment regions, may trigger long full collections. Avoid by tuning region size or rewriting to chunked processing.
15. Python — What Matters for Data Engineering
15.1 Object model
In CPython, almost everything is an object. Variables hold references, not values. b = a copies the reference, not the underlying object — that's why mutation is visible through both names.
15.2 The GIL
Only one thread executes Python bytecode at a time in CPython.
| Workload | Tool |
|---|---|
| CPU-bound math | multiprocessing.Pool, NumPy/Numba (release the GIL) |
| I/O bound (HTTP, DB) | asyncio + aiohttp/asyncpg |
| Mixed | concurrent.futures.ThreadPoolExecutor for simplicity |
| Big data | PySpark (Python driver, JVM executors) |
15.3 Iterators, generators, async
def chunks(iterable, n):
it = iter(iterable)
while True:
chunk = list(islice(it, n))
if not chunk:
return
yield chunk
Generators are lazy → constant memory for streaming ETL. asyncio is great for I/O concurrency; never call blocking I/O inside an async function without run_in_executor.
15.4 PySpark specifics
- Driver in Python ↔ JVM via Py4J. Executors run Python workers per task.
- Python UDFs serialize each row → 10–100× slower than DataFrame ops.
- Pandas UDFs (vectorized) with Apache Arrow run on columnar batches → 10–100× speedup.
- Avoid
collect()on big DFs; usetoLocalIterator().
15.5 Idiomatic performance
- Vectorize with NumPy/Pandas; avoid per-row Python loops.
dict.get,setdefault,defaultdict,Counter.__slots__to remove__dict__overhead for millions of objects.functools.lru_cachefor memoization;cached_propertyfor instance-level.
15.6 Q&A with reasoning
Q. Why does len({1, True, 1.0}) == 1? 1 == True == 1.0 and their hashes are equal, so the set treats them as the same key.
Q. Why is mutable default arg def f(x=[]) dangerous? The default list is created once at function definition time. Each call mutating it accumulates state across calls. Use x=None and x = x or [] inside.
Q. Why are dicts and sets average O(1)? Open addressing hash table with quadratic probing; resizes when load factor > ~⅔ to keep clusters short.
Q. Why does pandas.DataFrame.apply not release the GIL? Pandas applies the user function row-by-row in Python; the function executes Python bytecode. NumPy releases the GIL because its vectorized ops are in C.
16. Cloud, Containers, Kubernetes — Just Enough
(Docker / OCI)"] Img --> Reg["Registry"] Reg --> K8s["Kubernetes cluster"] K8s --> Dep["Deployment
(replicas, rolling update)"] Dep --> Pod1["Pod 1
(1+ containers)"] Dep --> Pod2["Pod 2"] Svc["Service
(stable virtual IP)"] --> Pod1 Svc --> Pod2 Ing["Ingress / Gateway API
(L7 routing)"] --> Svc
16.1 What bites engineers in production
- Requests vs limits — requests reserve capacity, limits cap usage. Set both, especially on memory.
- Liveness vs readiness probes — readiness gates traffic; liveness restarts crashed pods. Mixing them up kills healthy pods under load. Add a startup probe for slow-starting apps to avoid liveness false-positives during boot.
- Spark on K8s — each executor is a pod; use dynamic allocation with shuffle tracking. Pin to nodepools with taints; spot for cost.
- Secrets — don't bake them into images; mount via Secret/ExternalSecret/CSI driver.
- Workload identity — prefer per-workload cloud identities (IRSA / Azure WI / GCP WIF) over long-lived keys.
- Image supply chain — multi-stage builds, distroless base, SBOM (Syft), signing (cosign), admission control.
16.2 Cloud building blocks (cross-cloud)
| Layer | AWS | Azure | GCP |
|---|---|---|---|
| Object storage | S3 | ADLS Gen2 / Blob | GCS |
| Compute (VM) | EC2 | VM | Compute Engine |
| Containers | EKS / ECS / Fargate | AKS / ACA / ACI | GKE / Cloud Run |
| Function | Lambda | Functions | Cloud Functions |
| Spark | EMR / Glue / Databricks | HDInsight / Synapse / Databricks | Dataproc / Databricks |
| DW | Redshift | Synapse | BigQuery |
| Streaming | Kinesis / MSK | Event Hubs / Service Bus | Pub/Sub |
| Catalog | Glue Data Catalog | Purview | Dataplex |
| Vector | OpenSearch / Aurora pgvector | AI Search | Vertex AI Search / AlloyDB |
16.3 Observability trinity
- Metrics — Prometheus + Grafana (or Cloud Monitoring).
- Logs — Loki / OpenSearch / Cloud Logging.
- Traces — OpenTelemetry SDKs → Jaeger / Tempo / Cloud Trace.
- GenAI semantic conventions (OTel) for tracing prompts/retrievals/tool calls.
17. CI/CD and Testing — Confidence to Ship
A complete test pyramid for a data service:
(fast, pure)"] --> I["Integration tests
(Testcontainers: Postgres, Kafka, S3-mock, Spark)"] I --> C["Contract tests
(producer ↔ consumer, Pact)"] C --> D["Data quality tests
(Great Expectations, Soda)"] D --> E2E["E2E + smoke
(against staging)"] E2E --> Eval["AI / agent evals
(Ragas, golden sets, LLM-as-judge)"]
17.1 What each layer guards against
- Unit — pure logic regressions. Fast feedback.
- Integration — real DB, Spark, Kafka, S3 (LocalStack/MinIO). Most production bugs live at boundaries.
- Contract — schema and semantic agreements between services.
- Data quality — datasets failing even when code passes (schema drift, missing rows, distribution change).
- E2E — system behavior under realistic flows.
- AI evals — faithfulness, recall@k, tool-call correctness; gate merges on regression.
17.2 CI/CD patterns worth defending
- Trunk-based development + short-lived PR branches.
- Caching of Maven/Gradle/uv lockfile-based downloads (saves minutes).
- Reproducible builds — pinned base images, lockfiles, SLSA provenance.
- Security gates — SAST (Semgrep), dependency scan (Dependabot/Renovate), secrets scan (gitleaks), container scan (Trivy/Grype), SBOM (Syft) + signing (cosign).
- Deploy — GitOps (Argo CD / Flux); progressive delivery (Argo Rollouts canary/blue-green); auto-rollback on SLO regression.
17.3 Quality gates worth setting
- ≥ 80 % coverage on changed files (not absolute).
- 0 critical CVEs in shipped image.
- Mutation score on critical modules ≥ 60 %.
- Performance regression budget on integration tests.
- Contract test on every consumer/provider before deploy.
Part VIII — Putting It Together
18. One Coherent Mental Map
streaming ingest"] end subgraph Storage S --> Ice[("Iceberg lakehouse
(audit / replay)")] S --> G[("Graph
(lineage)")] S --> ES[("Search index")] S --> V[("Vector index")] end subgraph Serve Ice --> Q["Query / API"] G --> Q ES --> Q V --> R["RAG / Agents"] Q --> R end subgraph Govern Ice --> GovSvc["Catalog · DQ · Classification ·
Lineage · Policy · Stewardship"] G --> GovSvc ES --> GovSvc GovSvc --> R end R --> UI["UI · Notebooks · Apps"] Q --> UI
One sentence per layer:
- Ingest — absorb data with backpressure and idempotent / exactly-once semantics.
- Storage — keep cheap durable truth in a lakehouse; add specialized indexes (graph, search, vector) that can be rebuilt from the lake.
- Serve — expose query APIs and AI features that combine retrieval, traversal, and generation.
- Govern — make every asset discoverable, trusted, and auditable; let humans approve what models can't.
19. The Ten Self-Check Questions
Answer each out loud using the What → Why → How → Trade-off → Failure → Example framework. If you can't, that's the topic to revisit.
- Walk a Spark query through Catalyst → Tungsten → AQE → physical execution. Where do plans change at runtime, and why?
- Why is shuffle the dominant cost in distributed query engines, and how does push-based shuffle improve it?
- Explain Iceberg's commit protocol and how two concurrent writers reconcile.
- Compare column-level lineage in a graph DB vs recursive SQL at 100 M edges. Where does each break?
- Derive the cost of self-attention and explain KV cache + FlashAttention.
- Design a multi-tenant RAG system: where do you enforce tenant isolation, and how do you defend against indirect prompt injection?
- Sketch an idempotent CDC pipeline Postgres → Kafka → Iceberg with exactly-once semantics. Where can duplicates appear, and how do you dedupe?
- Choose between broadcast hash, shuffle hash, and sort-merge join — what stats and runtime signals drive AQE's decision?
- Explain the outbox pattern and why dual-write is unsafe without it.
- Layered PII classification: what's deterministic, what's ML, what's an LLM, and where does a human approve?
20. Closing — How to Compound These Ideas
The point of going this deep isn't trivia. Once you internalize why each component exists, you can compose them. A new architecture is just an old set of trade-offs in a new shape. If you can articulate the original problem, the rest is engineering.
Three habits make this compound:
- Speak every concept out loud with the What → Why → How → Trade-off → Failure → Example framework. Silent reading creates a false sense of mastery.
- For each topic, draw the flow before answering. The diagram forces you to commit to a structure; the words follow.
- For each "why," chase one more "why." "Shuffle is expensive." → Why? "Network + sort + spill + reducer pressure." → Why does that hurt p99? "One straggler bounds stage completion." → How do I see it? "Spark UI: max vs median task time." That fourth-level answer is what senior interviewers listen for.
Anything weaker than that, treat as a draft.
Part IX — In-Depth Quiz Bank with Reasoning
Each quiz item below is answer + reasoning, not just the answer. Speak each one out loud. If you only know the answer but not the why, you don't know it yet.
21. Spark Core — Quiz With Reasoning
- Default
spark.sql.shuffle.partitionsis 200. Historical safe middle ground. Modern jobs resize per data volume so each shuffle partition is ~128–256 MB. AQE coalesces tiny partitions automatically, so 200 is "OK" out of the box but rarely optimal. - Wide vs narrow.
groupByKeyis wide because rows for the same key must meet on one reducer, requiring shuffle.map,filter,mapPartitionsare narrow because each output partition derives from one input partition. reduceByKeyvsgroupByKey().mapValues(_.sum).reduceByKeydoes map-side combine — it sends only partial sums across the network.groupByKeyships every value, blowing up shuffle and reducer memory. Same result, very different cost.spark.sql.adaptive.enableddefaults totruein 3.5. AQE coalesces tiny partitions, switches sort-merge to broadcast when actual stats are small, and splits skewed partitions. Default-on enables runtime adaptation without manual hints.MEMORY_ONLYstorage level. Stores deserialized objects in heap. Fastest read but heaviest memory; eviction on memory pressure. PreferMEMORY_AND_DISK_SERfor large reused DataFrames so spilled partitions are still recoverable.- Unit of parallelism = partition. One task per partition per stage. Total parallelism scales with
partitions × cores per executor × number of executors. persistvscheckpoint.persistkeeps lineage; if a partition is lost, Spark recomputes from lineage.checkpointwrites to reliable storage and truncates lineage — critical for very long chains and iterative algorithms where recomputation would be catastrophic.- Broadcast variable vs
collect. Broadcast distributes a value once per executor; tasks read it locally.collectreturns the entire DataFrame to the driver, paying full network cost and risking driver OOM. df.explain(true). Shows Parsed, Analyzed, Optimized logical plans and the Physical plan. Use it to confirm predicate pushdown, projection pruning, partition filters, and join strategy.- Whole-stage codegen disabled when. Inside UDFs (non-codegen), with > 100 inputs/outputs, hitting fallback rules, or when
spark.sql.codegen.wholeStage=false. Codegen merges operators into one Java function; UDFs are opaque to Catalyst, so it cannot fuse them.
22. Spark Tuning — Quiz With Reasoning
autoBroadcastJoinThresholddefault = 10 MB. Keeps broadcasts safe by default. Stats outside this size disable broadcast; you can hintbroadcast(df)if Catalyst underestimates.- Disable broadcast for one join. Set
spark.sql.autoBroadcastJoinThreshold=-1, or use the/*+ NO_BROADCAST_HASH(t) */hint per join (3.4+). Why: if the "small" side is large in reality, broadcast OOMs. spark.sql.adaptive.skewJoin.enabled. Splits a skewed partition into sub-partitions and replicates the other side, shifting work from one slow reducer to several fast ones — task runtime is bounded by the slowest task, so cutting stragglers wins.- Small-files threshold ~ < 64 MB. Below this, listing/opening overhead dominates read time on object stores. Compaction targets 128–512 MB for analytics.
- Default write format = Parquet since Spark 2.0. Columnar, predicate pushdown, footer stats, splittable.
coalesce(1)vsrepartition(1).coalesceavoids shuffle and may produce uneven partitions;repartition(1)triggers a full shuffle to evenly redistribute then collapse. Usecoalescefor small downsizing without shuffle,repartitionto force redistribution.- Best storage for a very large reused DF.
MEMORY_AND_DISK_SER. Serialized form fits more rows per byte and spills under pressure instead of OOM. - GC time signal in Spark UI. Stages/Executors GC time column; if more than ~10 % of task time, tune memory/allocation (more partitions, smaller cache, raise overhead).
- Approximate distinct.
approx_count_distinctuses HyperLogLog++. Exact distinct requires shuffling all values; HLL needs KB-sized sketches at ~1 % error. - Bucketing vs broadcasting. For repeated large-large joins on the same key, bucketed tables avoid shuffle every time the join runs. Broadcasting only works for small dimensions.
23. SQL / RDBMS — Quiz With Reasoning
- Default isolation: PG = Read Committed; MySQL InnoDB = Repeatable Read. Read Committed sees only committed values per statement, allowing non-repeatable reads but maximizing concurrency. MySQL's Repeatable Read uses gap locking to prevent phantoms in many cases.
- Covering index. Index that contains every column the query needs, so the engine answers directly from the index without a heap fetch. Why it's faster: random heap I/O is the slowest part of an index lookup.
LEFT JOINvsLEFT SEMI JOIN. LEFT JOIN keeps both tables' columns and emits NULLs for non-matches. LEFT SEMI returns only the left rows where a match exists; useful for existence filters where you don't need right-side columns.VACUUM FULLvsVACUUM. RegularVACUUMmarks dead tuples reusable.VACUUM FULLrewrites the table and indexes to reclaim space; takes ACCESS EXCLUSIVE lock and rewrites blocks.- Prefix search index in Postgres. B-tree with
text_pattern_opsforLIKE 'foo%'because it sorts lexicographically; trigram GIN supports more flexible patterns. - MySQL InnoDB clustered index. The PRIMARY KEY — table rows are physically stored in PK order. Choosing a wide or non-monotonic PK fragments inserts.
- Phantom-read prevention. Serializable isolation; PG's Repeatable Read uses Serializable Snapshot Isolation (SSI) to detect conflicts and abort.
- Leftmost-prefix rule. Composite index
(a, b, c)is usable for filters ona,(a, b), or(a, b, c)— not forbalone — because B-tree ordering is anchored on the leading column. - BIGINT vs UUID PK. BIGINT is small, sequential, fast to insert. UUIDv4 is globally unique without a central allocator but destroys clustered-index locality. Use ULID or UUIDv7 for time-ordered uniqueness.
EXPLAIN (BUFFERS). Shows shared/local/temp buffer hits, reads, dirtied pages per node — the actual I/O profile, not just the cost estimate.
24. Distributed Storage — Quiz With Reasoning
- S3 max object size = 5 TB. Single-object cap; multipart upload is required above 5 GB so failed chunks can be retried.
- HDFS default block size = 128 MB. Large blocks reduce NameNode metadata pressure and amortize seek cost over reads.
- Spark on S3 committer (2026). S3A magic committer. It uses S3 multipart-upload metadata to atomically publish task outputs and avoids slow rename. Classic FileOutputCommitter is unsafe with speculative execution because two attempts can both produce the same output.
LISTconsistency on S3. Strongly consistent since Dec 2020. Before that, eventual consistency caused stale listings; lakehouse formats sidestep listings via metadata files.- Cheapest archival tier on AWS. S3 Glacier Deep Archive. Hours to retrieve; 180-day minimum storage. Use for compliance archive, not active reads.
- Speculative execution risk on S3. Two attempts may write the same task output. The classic committer's rename can leave duplicates; the magic committer assigns unique multipart paths and commits via metadata.
- HDFS Erasure Coding RS-6-3. Stores 6 data + 3 parity blocks; effective replication ratio = 1.5×. Saves storage at CPU cost for encoding/decoding.
- ADLS protocol prefix in Spark.
abfss://(secure). Use it instead of olderwasbs://Blob URIs. - Parquet has native column-level encryption (Modular Encryption since 1.12). Allows per-column keys and integrity checks for governance.
- Best partition column for daily logs.
date(low cardinality, time-aligned). Avoid high-cardinality columns likeuser_idto prevent partition explosion.
25. Iceberg / Lakehouse — Quiz With Reasoning
- Iceberg atomic commit mechanism. Catalog-level atomic swap (HMS / Glue / REST / Nessie). The catalog provides compare-and-set on the table pointer; readers see only committed snapshots.
- Manifest list contents. Pointers to manifest files for a snapshot, with partition-summary stats used for partition-level pruning before reading the manifests themselves.
- Iceberg schema evolution by column ID, not name. Renames preserve identity; old data still maps to the right logical column.
expire_snapshots. Removes old snapshots beyond retention and unreferenced data files. Without it, table metadata grows indefinitely and storage cost balloons.- Hudi CoW vs MoR. Copy-on-Write rewrites whole files on update; reads stay simple but writes are heavier. Merge-on-Read appends delta files merged at read; faster writes, slower reads, periodic compaction needed.
- Delta transaction log location.
_delta_log/directory at table root holding JSON commits and Parquet checkpoints. - Default Iceberg catalog standard in 2026. REST catalog. Cross-engine, cloud-native, language-agnostic.
- Iceberg time-travel SQL.
FOR VERSION AS OF <id>orFOR TIMESTAMP AS OF <ts>. Engines resolve via snapshot history. - Z-order improves multi-column data skipping by spatial locality. Records with similar values land in the same files, so predicate pruning skips more files.
- Equality vs position deletes. Position deletes point at file + row and are cheap to apply at read. Equality deletes specify a predicate and require a join at read time, more expensive but more flexible (good for streaming CDC).
26. Data Modeling — Quiz With Reasoning
- Grain. The level of detail one fact-table row represents. Grain decisions drive every aggregation. Mixing grains causes silent double counting.
- Star vs snowflake. Snowflake normalizes dimensions into multiple tables; star keeps them flat. Star = simpler, fewer joins, faster scans. Snowflake = less duplication, harder reads.
- SCD Type 2. Track full history with
effective_from,effective_to,is_current. Each change creates a new row; old rows remain for point-in-time queries. - Surrogate keys. Insulate fact from natural-key changes, allow SCD2, enable smaller integer joins. Natural keys are still stored for traceability.
- Bridge table. Resolves many-to-many between fact and dimension. Example: a healthcare fact links to multiple diagnoses through a bridge.
- Additive measures sum correctly across all dimensions. Semi-additive (e.g., balance) sums across some dimensions; non-additive (e.g., ratios) needs special handling.
- Junk dimension. Combines low-cardinality flags into one dim to avoid many tiny dims and reduce fact-table key counts.
- Conformed dimension. Same dim used identically across multiple fact tables, enabling cross-process analysis.
- Factless fact table. Records events with no measure column. Example: student-attendance event used for counts.
- Late-arriving dimension fix. Insert "Unknown" placeholder in the dim immediately, update the fact with the proper key when the dim arrives, or look up the natural key with a backfill job.
27. Graph Databases — Quiz With Reasoning
- Index-free adjacency. Each node stores direct pointers to its relationships. Once you anchor a query at a node, traversal is pointer-following, not global index lookup.
OPTIONAL MATCHreturns NULLs when patterns are absent, similar to LEFT OUTER JOIN.- Uniqueness constraint.
CREATE CONSTRAINT FOR (d:Dataset) REQUIRE d.id IS UNIQUE. Enforces a single dataset peridand creates a backing index. *1..5. Variable-length path of 1 to 5 hops. Always bound length to avoid traversal explosion.- Relationship direction. Stored directed; queries can ignore direction with
-[r]-. - Neptune query languages. openCypher and Gremlin; SPARQL via a separate endpoint.
- One-hop neighbors. O(degree). For supernodes this is huge; bound traversal or filter by relationship type.
- Plan inspection.
EXPLAIN(no execution) andPROFILE(execution + DB hits). - Supernode. A node with millions of relationships; traversals through it explode. Mitigate by tag/tenant partitioning, indexed predicates, or relationship-level filters.
- Best for in-memory graph workloads. Memgraph (or Neo4j tuned with sufficient page cache).
28. Elasticsearch — Quiz With Reasoning
- Default scoring = BM25 since 5.x. Improves over TF-IDF by saturating term frequency and normalizing for document length.
textvskeyword.textis analyzed (tokenized) for full-text search;keywordstored as-is for exact match, sorting, aggregations. Most catalog fields need both via multi-field mapping.- JVM heap ≤ 32 GB. Above 32 GB the JVM uses 64-bit object pointers; compressed oops are lost, raising memory cost. Reserve remaining RAM for OS file cache (Lucene loves it).
- Translog. Per-shard write-ahead log replayed on crash. Tunable via durability settings; flush makes the shard durable to disk.
- Default refresh interval = 1 second. Set to
-1during bulk indexing for throughput, then restore. - Change primary shards after creation? Not directly. Use reindex into a new index with desired shards, or shrink/split.
- Phased rollover indexes. Index Lifecycle Management (ILM) with rollover by size/age/docs.
- Cheap top-K aggregations.
termsaggregation withsizeandshard_sizetuned. Be careful with high-cardinality fields; use composite aggregation for paging. - Vector search type. Dense vector with k-NN (HNSW indexing, since 8.x).
matchvsterm.matchanalyzes the query text (tokenizes) before matching.termdoes exact match on the indexed token; use only onkeywordfields.
29. LLMs — Quiz With Reasoning
- Self-attention complexity. . Each token attends to every other token, and each attention head computes a -dimensional dot product. Long context blows up compute and memory.
- Default tokenizer for Llama family. SentencePiece BPE with byte fallback. Byte fallback ensures any text encodes, even unseen scripts/symbols.
- Decoder-only models. GPT, Llama, Mistral, Claude, Gemini. They generate left-to-right with causal attention and dominate chat/agents.
- Cosine vs dot. Without normalization, dot favors longer vectors. After normalization, cosine equals dot. Most embedding APIs return normalized vectors.
- Most popular ANN index in 2026 = HNSW (hierarchical navigable small world). Good recall vs latency, supports filtered search.
- Continuous batching tool. vLLM with PagedAttention. New requests slot into a running batch, raising throughput dramatically.
- SFT vs RLHF/DPO. SFT trains the model to imitate reference answers. RLHF/DPO trains it to prefer outputs humans rated higher. RLHF/DPO improves helpfulness and safety alignment beyond imitation.
- System prompt. Top-of-conversation instruction defining role, constraints, format. Treated as trusted context; never let retrieved data override it.
- KV cache contents. Per-layer attention K and V tensors for all already-generated tokens. Memory grows linearly with batch × layers × seq length.
- Smallest GPU tier where Llama-3-70B Q4 fits. ~40 GB (e.g., A100 40 GB or H100 partition with offload), or 2×24 GB consumer GPUs with tensor parallel.
30. RAG and Agents — Quiz With Reasoning
- RRF = Reciprocal Rank Fusion. Combines two ranked lists by summing . Good when scores are not directly comparable (BM25 vs cosine).
- HyDE = Hypothetical Document Embeddings. Have the LLM imagine an answer, embed it, and retrieve real docs by similarity to that. Helps when queries are short and ambiguous.
- Reranker purpose. Improves top-K precision by re-scoring candidates with a more expensive cross-encoder.
- Vector DB graph index = HNSW.
- Long-context failure that RAG sidesteps. "Lost in the middle" — recall drops for content placed in the middle of long contexts. RAG sidesteps it by retrieving only what is needed.
- Ragas faithfulness metric. Measures whether the generated answer is supported by retrieved context. Low faithfulness = hallucination.
- ReAct = Reasoning + Acting interleaved with tools. The model reasons, calls a tool, observes the result, and continues.
- Function-calling output. Model emits a structured JSON object matching the declared schema; your code performs the actual call. That separation is what lets you sandbox tools.
- Tenant filters in RAG. Apply pre-retrieval as a metadata filter; assert again pre-generation. Post-only filtering can leak counts/snippets and waste compute.
- Chunk size. ~500–1,000 tokens with 10–20 % overlap. Big enough for context, small enough for precise retrieval.
31. Distributed Systems HLD — Quiz With Reasoning
- CAP under partition. You must choose C or A. Partition tolerance is mandatory for distributed systems. Real systems also obey PACELC: when no partition, choose Latency vs Consistency.
- Quorum for strong consistency in N=5. . Examples: or . Overlapping quorums guarantee a read sees the latest write.
- Consistent hashing benefit. Re-shards minimally when nodes are added/removed (only keys move). Virtual nodes balance load.
- Outbox pattern. Atomically write DB row + outbox event in same transaction; relay publishes events to broker. Solves dual-write inconsistency.
- At-least-once vs exactly-once. At-least-once may deliver duplicates. "Exactly-once" is really at-least-once + idempotent consumer or transactional sink.
- 2PC vs Saga. 2PC is synchronous, blocking, strong but fragile. Saga is async, eventual, with compensating actions; better for microservices.
- Postgres CDC standard. Logical replication / WAL via Debezium emitting Avro/JSON/Protobuf.
- Avoid hot key on a global counter. Bucket/salt the key, distribute across many partitions, aggregate periodically.
- Raft fault tolerance. Need 3 nodes to tolerate 1 failure. Quorum is .
- Idempotency-key location. Header or body, validated server-side; persisted with the record for dedupe across retries.
32. Algorithms and Data Structures — Quiz With Reasoning
- HLL memory for ε = 0.01 ≈ 16 KB. HyperLogLog uses leading-zeros of hashes across registers; fewer registers means more error.
- Bloom filter and false negatives. Never. Insertion sets all hash bits; if any bit is missing during lookup, the item was definitely never inserted. False positives only.
- Heap from list complexity. O(n) using
heapify(sift-down from middle). Pushing one-by-one is O(n log n). - Best sort for nearly-sorted data. Insertion sort (O(n) when nearly sorted), or Timsort (Python default) which detects natural runs and merges.
- Path compression cost. Effectively ≈ constant. Combined with union-by-rank.
- Trie space for 10k words avg length 8. ~80k nodes worst case, often less with shared prefixes. Use compact trie / DAWG to save more.
- Dijkstra failure. Negative edges; use Bellman-Ford for negative weights (no cycles), or Johnson's for sparse graphs.
- Shortest unweighted path = BFS. Each level expansion equals one hop, so first time you reach a node is via the shortest path.
- HLL
k. Number of registers. Typical registers gives ~1 % error. - Quickselect. Average O(n), worst O(n²). Median-of-medians gives O(n) worst case.
33. Java / JVM — Quiz With Reasoning
- Default GC in Java 11 = G1GC. Java 9 made G1 default because it provides predictable pause times via region-based collection, suiting server workloads with multi-GB heaps. Parallel GC maximizes throughput but has long pauses, hurting p99. ZGC/Shenandoah trade some throughput for sub-millisecond pauses.
- Two ways to safely publish a value without
synchronized. Avolatilefield gives happens-before for visibility/ordering. Afinalfield gives safe publication after the constructor finishes.AtomicReferenceprovides atomic CAS. -XX:+HeapDumpOnOutOfMemoryErrorwrites a.hprofwhen the JVM throwsOutOfMemoryError. Heap dumps captured at the moment of failure expose the real retained set, hard to reconstruct later.ConcurrentHashMap.size()complexity. O(n) approximation aggregated viasumCountoverLongAdder-style cells. Bin-level CAS removed the global lock, so a coherent size requires summing per-bin counters.RunnablevsCallable.Runnable.run()returns void and cannot throw checked exceptions.Callable<T>.call()returnsTand can throw. Executors useCallableand surface results viaFuture.- Humongous region in G1. A region holding an object ≥ 50 % of region size, allocated directly into old gen. Many humongous allocations indicate big arrays/strings that should be streamed.
LongAddervsAtomicLong.LongAddershards the counter into multiple cells, reducing CAS contention; reads sum cells. Use it for high write contention;AtomicLongwhen reads must be exact and writes are infrequent.- Effectively immutable class. All fields
final, no leakingthisfrom constructor, defensive copies of mutable refs, no mutator methods. JMM safe-publication guarantees other threads see constructor-completed state. - Finalizer thread. A JVM-internal thread that processes
Object.finalize()queue. Deprecated for removal because finalization is unpredictable, can resurrect objects, and runs on a single bottleneck thread. Prefertry-with-resourcesandCleaner. - Print all GC events with timestamps (Java 9+).
-Xlog:gc*=info:file=gc.log:time,uptime,level,tags. Unified logging replaces fragmented Java 8 flags.
34. Python — Quiz With Reasoning
bool([])isFalse. Empty containers are falsy;__bool__falls back to__len__, which returns 0. That's whyif not items:is idiomatic.a = 1000; b = 1000—isvs==.==is True.ismay be False because identity holds only when CPython interns the value; only small ints in[-5, 256]are guaranteed cached. Avoidisfor value comparison.*args, **kwargs. Desugar to a positionaltupleand keyworddict. Useful in decorators withfunctools.wraps.- Hashable class. Implement
__hash__and__eq__consistently. Hashable classes should be effectively immutable; mutating after insertion into a set/dict corrupts the structure. - Shallow vs deep copy. Shallow
copy.copyduplicates the outer container, sharing inner references. Deepcopy.deepcopyrecursively duplicates. Choose based on whether mutation of inner objects must be isolated. yield fromdelegates iteration, exceptions,send, and return-value of a sub-generator. Composes generators without re-implementing the protocol.- Mutable default argument trap. Default values evaluate once at function-definition time.
def f(x=[])shares one list across calls. Usedef f(x=None): x = x or []. len({1, True, 1.0}) == 1.1 == True == 1.0, and their hashes match (hash(1) == hash(1.0) == hash(True)), so the set collapses them.- Membership: list vs set.
setis O(1) average via hashing;listis O(n). @functools.wraps. Preserves wrapped function's__name__,__doc__,__qualname__,__module__,__wrapped__. Without it, decorators hide identities and break introspection.
35. OOP / LLD Patterns — Quiz With Reasoning
if-instanceofcascade violates OCP (and often LSP). OCP says behavior should extend through new types, not edits to existing code. The cascade adds new behavior by editing one method.- Strategy vs State. Strategy = caller picks an algorithm and swaps it. State = the object's behavior changes due to internal lifecycle. Strategy fits pluggable PII classifiers; State fits a job FSM (PENDING → RUNNING → SUCCEEDED).
- Add tracing without modifying repositories. Decorator pattern wraps a repository implementing the same interface. Proxy is similar but typically adds control or remote concerns.
- Singleton trade-off. Hidden global state, hard to mock, tests interfere across cases, parallel tests become unsafe. Mitigate by injecting via a DI container with proper scoping.
- DTO location in hexagonal architecture. Application layer (input/output boundary) or shared kernel; never in domain. The domain must remain free from delivery and persistence concerns.
- Aggregate vs Entity. An aggregate is a cluster of entities and value objects with one root that enforces invariants. An entity has identity but no transactional boundary by itself.
- Visitor adds operations easily but resists adding new types. True. New operation = new Visitor; new type forces editing every Visitor.
- Build a complex immutable object with optional fields → Builder. Telescoping constructors and many setters break immutability and readability; Builder validates and constructs once with
.build(). - CoR vs Pipeline. Chain of Responsibility lets any handler short-circuit by handling and returning. Pipeline always passes through every stage transforming data. CoR for validation/auth chains; pipeline for ETL.
- Composition over inheritance. Composition gives flexibility, supports delegation, avoids fragile-base-class problems, and simplifies testing because dependencies can be swapped.
36. Cloud & Kubernetes — Quiz With Reasoning
- K8s scheduler factors. Resource requests + node fit + affinity + taints/tolerations + topology spread. Scheduling failures show in
Events. - Requests vs limits. Requests = guaranteed (used by scheduler); limits = maximum allowed. Memory beyond limit → OOMKilled; CPU beyond limit → throttled.
- Memory limit exceeded. OOMKilled by kernel cgroups; Kubernetes restarts the container per restart policy.
- PodDisruptionBudget. Maximum number of pods that can be voluntarily disrupted at once during drains/upgrades. Protects availability.
- Run-to-completion =
Job(andCronJobfor scheduled). Tracksucceeded/failedcounts; clean up with TTL controller. - Mounting secrets as files. Files allow stricter perms, atomic rotation, and avoid leakage in process listings, container metadata, or crash dumps.
- In-cluster service type = ClusterIP (default).
- AWS pod-to-IAM mapping = IRSA (IAM Roles for Service Accounts) using OIDC trust between EKS and IAM.
- Image layer storage. Tar layers stored content-addressed by sha256 digest; identical layers shared between images.
kubectl drainsafely evicts pods from a node before maintenance, respecting PDBs and graceful termination.
37. CI/CD & Testing — Quiz With Reasoning
- JUnit 5 lifecycle hook before all tests =
@BeforeAll(static method), runs once per test class. Use it for shared, expensive resources like Testcontainers. @Mockvs@Spy.@Mockreturns defaults for unstubbed methods;@Spywraps a real object so unstubbed methods run real code. Spies are useful for partial mocking but easy to misuse.- PyTest fixture scopes.
function(default),class,module,session. Wider scopes save setup time but require careful state management. - Selenium W3C standard. Standardized WebDriver protocol; replaced JSON Wire and unified browser drivers.
- Testing code with
Thread.sleep. Inject a clock or scheduler abstraction; never sleep in tests. Sleeps cause flakiness. assertDataFrameEqualignores row order by default (sorts before comparing). UsecheckRowOrder=Trueif order matters.- SBOM tool = Syft (or Trivy SBOM, Anchore). SBOMs enable vulnerability scanning and supply-chain audits.
- SAST vs DAST. SAST analyzes source code at rest; DAST tests a running app. Both are needed; neither alone catches all classes of bugs.
- Flaky-test budget. Allowed % of intermittent failures before a test is auto-quarantined. Forces teams to fix flakes rather than ignore them.
- Trunk-based development. Faster integration, fewer merge conflicts, continuous delivery friendly. Long-lived branches drift and break.
38. Data Governance — Quiz With Reasoning
- OpenLineage core entities = Job, Run, Dataset. Lineage must represent what code (Job) ran in a specific Run, reading/writing which Datasets.
- Lineage granularity = table + column level. Column-level requires SQL parsing or runtime plan capture; table-level is the fallback.
- Catalog vs metastore. Catalog = governance/business metadata + UX. Metastore = technical schema registry used by query engines (Hive metastore, Glue).
- Current vs historical state. Both. Current state for query; snapshots/history for audit and time-travel.
- Glossary term vs tag. Glossary terms are curated business vocabulary with definitions and ownership. Tags are lightweight labels.
- Lineage edge storage. Graph DB (Neo4j/Neptune) for traversal; or columnar with adjacency for very large scale; often both with replay from raw events.
- DQ rule types. Completeness, uniqueness, validity, freshness, referential integrity, reasonableness.
- RBAC vs ABAC. RBAC assigns by role; ABAC composes attributes (subject + resource + env). ABAC scales better at enterprise scale and supports tag-based policies.
- Data contract. Producer-consumer agreement (schema, semantics, SLAs) version-controlled and tested. Breaks the silent-coupling problem.
- PII propagation. Tag at source, propagate through lineage automatically; downstream columns derived from a PII column inherit sensitivity until proven otherwise.
39. Agentic AI Design — Quiz With Reasoning
- ReAct = Reasoning + Acting interleaved with tools. Visible reasoning helps debugging; structured tool calls allow safe execution.
- Separate "judge" LLM is an independent quality check decoupled from generation; less prone to self-confirmation bias.
- Cheap dedupe across LLM calls. Cluster inputs by template/embedding similarity; reuse outputs for similar inputs.
- Trace store. Dedicated trace store (Langfuse, Phoenix, Arize) or OTel-compatible backend. Generic logging is too coarse for prompt + tool + retrieval traces.
- Anthropic-style "tool use". Model emits structured
tool_useblocks; orchestrator executes and feeds results back. - Bound max iterations. Cost + safety; prevent infinite loops, runaway tool use, and budget overrun.
- RAG vs fine-tune for changing data → RAG. Fine-tuning is for stable behavior, not freshness.
- GraphRAG. Retrieval that traverses a knowledge graph, expanding context via edges (lineage, ownership, policy).
- Agent failure mode #1. Tool errors / schema mismatches; second is over-iteration / cost.
- Tenant filters. Enforce pre-retrieval as filter and pre-generation as assert.
40. Coding-Round Pattern Recognition — Quiz With Reasoning
- FIFO queue in Python =
collections.dequebecausepopleftis O(1).list.pop(0)is O(n). - Top-K largest streaming numbers. Min-heap of size K. New element compared to heap top; replace if larger. O(n log k).
- Subarray sum equals K. Prefix sum + hash map. Tracks how many prefix sums equal
current_prefix - K. - Dependencies and ordering = topological sort. Detects cycles when nodes never reach in-degree 0.
- Connected components / repeated merges = Union-Find. With path compression + union by rank, operations are nearly O(1).
- Next greater element = monotonic stack of indices in decreasing value order. Pop until top is greater than current.
- Shortest unweighted path = BFS. Each level is one hop, so first reach is shortest.
- Shortest weighted path with non-negative weights = Dijkstra. Heap maintains current best; popped distance is final because edge weights are non-negative.
- Minimum speed/capacity/time = binary search on answer. Requires monotonic feasibility: if X works, all values ≥ X work.
- All combinations = backtracking. Grow a partial solution; backtrack on invalid branches.
- Sort lower bound = O(n log n). Comparison sorts have lower bound Ω(n log n); Timsort hits it on real-world data.
- Heap push/pop = O(log n). Sift-up/down along height.
bisect_leftreturns the first index where the value can be inserted to keep sorted order; equivalent to first index witharr[i] >= target.- Recursion on deep graph. Stack-overflow risk with deep recursion. Use iterative DFS or raise recursion limit cautiously.
- BFS vs DFS memory. BFS holds an entire level (can be large). DFS holds path depth. Both are O(V) worst case but with different shapes.
- State in DP is a representation of the subproblem whose answer composes the bigger answer. State must capture all decisions affecting the future.
- Monotonic feasibility. Property where if X works, every X' on one side also works (or fails). Required for binary search on the answer.
- No false negatives, possible false positives = Bloom filter. Hash bits set at insert; absent bit guarantees absence; collisions cause false positives.
- Approximate distinct count = HyperLogLog. Counts via leading-zero patterns of hashes.
- Spark top-N per group =
row_number()overWindow.partitionBy(...).orderBy(...). Deterministic and efficient because Spark partitions and sorts within partition.
41. Final Exam — Mixed Drill
Twenty rapid questions to take cold; force yourself to answer each in two sentences using the What → Why → How shape.
- Why does AQE flip a sort-merge join to broadcast at runtime?
- What's the difference between predicate pushdown and dynamic partition pruning?
- Why does Iceberg use column IDs for schema evolution instead of names?
- How does an outbox row + Debezium relay deliver effectively-once?
- Why is
ts >= '...' AND ts < '...'indexable butdate_trunc('day', ts) = '...'is not? - Why does a heap of size beat sorting for top-K?
- Why does
volatilenot makecount++thread-safe? - Why is BFS the right algorithm for shortest unweighted path?
- Why is hybrid retrieval more robust than dense-only?
- Why does the KV cache turn generation into per step?
- Why is
MEMORY_AND_DISK_SERusually safer thanMEMORY_ONLYfor big DataFrames? - Why is index-free adjacency fast for graph traversal but irrelevant for aggregations?
- Why is partition-by-
user_ida bad idea for daily logs? - Why does Iceberg need maintenance jobs (
expire_snapshots,rewrite_data_files)? - Why does ABAC compose better than RBAC for multi-tenant data access?
- Why do agents need hard caps on iterations and tokens?
- Why does
groupByKey().mapValues(_.sum)lose toreduceByKey? - Why is the 32 GB heap line important for Elasticsearch?
- Why does guarantee strong reads in a leaderless quorum store?
- Why is "small files" a worse problem on object stores than on HDFS?
If you can answer all twenty in 60 seconds each with confidence, you're ready.