Overall Engineering Clarity — Data, Distributed Systems and AI (Deep Dive)

A long-form, primary study companion. Internals, flows, decision trees, code, and Q&A with reasoning across Spark, lakehouse, graphs, search, LLMs, RAG/agents, distributed HLD, governance, modeling, SQL, JVM, Python, K8s and CI/CD.

Overall Engineering Clarity — Data, Distributed Systems and AI

This is a single, long, primary study companion. Read it cover-to-cover the first time, then jump back into a section before you need to discuss it. Every section follows the same shape:

mental model → flow / diagram → internals → decision tables → code → Q&A with reasoning.

How to read each concept — the "Why → How → Trade-off" framework

Most engineers can recite what a thing is. The clarity gap shows up when you ask why it exists and what breaks. Use this template for every concept:

What it is        →  one sentence definition
Why it exists     →  the problem before this solution
How it works      →  internals / mechanism
When to use       →  correct fit
When NOT to use   →  anti-patterns
Trade-offs        →  latency, cost, consistency, complexity
Failure modes     →  what bites you in production
Concrete example  →  code or architecture you can point at

Intuition shortcut. Every technology in this post solves one of three pains:

  1. One machine isn't enough → distribution, partitioning, replication.
  2. Cheap object storage isn't transactional → lakehouse formats, commit protocols.
  3. Models don't know your private/fresh data, and they hallucinate → embeddings, retrieval, agents, evals.
flowchart LR P1["One machine
not enough"] --> S1["Spark · Sharding ·
Replication · Consensus"] P2["Object storage
not transactional"] --> S2["Iceberg · Delta · Hudi"] P3["LLM doesn't know
your data + can hallucinate"] --> S3["Embeddings · RAG · Agents · Evals"] S1 --> O["Reliable, scalable,
governed data + AI platform"] S2 --> O S3 --> O

Default for every answer: problem first, mechanism second, trade-off third, failure mode fourth, example fifth.


Part I — Distributed Compute

1. Apache Spark — Internals That Actually Matter

1.1 Mental model

Spark is a distributed query engine. The single sentence to remember:

Spark performance is mostly about where data moves, not how much CPU you have.

The bottleneck is almost always shuffle — moving rows across the network so that all rows with the same key meet on the same reducer.

1.2 Execution flow — what really happens when you call an action

flowchart TB A["Driver
(SparkContext)"] --> B["DAG Scheduler
splits at shuffle boundaries"] B --> C["Task Scheduler
1 task per partition per stage"] C --> D["Cluster Manager
YARN / K8s / Standalone"] D --> E1["Executor 1
(JVM + Python worker)"] D --> E2["Executor 2"] D --> E3["Executor N"] E1 --> S["Shuffle Service
(sort-based / push-based)"] E2 --> S E3 --> S S --> R["Reducer tasks
next stage"] R --> Out["Action result"]

Layers of work:

  • Application → Job (per action) → Stage (split by shuffle) → Task (per partition).
  • DAG Scheduler decides stages; Task Scheduler dispatches tasks; BlockManager stores intermediate data on each executor.
  • Narrow transformations (map, filter, select, mapPartitions, union) keep partitions local.
  • Wide transformations (groupBy, join, distinct, repartition) cause shuffle.
  • Catalyst rewrites your logical plan; Tungsten generates byte-efficient Java code; AQE (Adaptive Query Execution) flips plans at runtime once it sees actual partition sizes.

1.3 Catalyst — the optimizer pipeline

flowchart LR SQL["DataFrame /
SQL"] --> P["Parsed Logical Plan"] P --> A["Analyzed Plan
(resolve cols, fns, types)"] A --> O["Optimized Logical Plan
(rule-based)"] O --> Phys["Physical Plans
(many candidates)"] Phys --> Cost["Cost-Based Pick
(uses ANALYZE stats)"] Cost --> CG["Whole-stage Codegen
(Tungsten)"] CG --> Exec["Execute on executors"]

Key rules Catalyst applies (you should be able to name them):

Rule What it does
Predicate pushdown Push WHERE filters into the file scan (Parquet stats / partition pruning).
Projection pruning Read only the columns the query references.
Constant folding Evaluate constant expressions at plan time.
Boolean simplification (x AND TRUE)x; collapse always-false branches.
Join reordering Use stats to pick the cheapest join order (CBO).
Reorder filters & projections Push them as low as possible in the tree.

1.4 Tungsten — what makes Spark fast on the CPU

  • Whole-stage codegen collapses an entire operator tree (filter → project → aggregate) into a single Java function. JVM JIT can inline and vectorize that.
  • Off-heap binary format avoids JVM object overhead and GC churn.
  • Cache-friendly columnar processing keeps L1/L2 hot.

UDFs hurt because Catalyst can't see inside opaque Python/Scala lambdas, so codegen is broken and predicate pushdown is blocked.

1.5 AQE — Adaptive Query Execution

Three runtime optimizations that fire during query execution (default on since 3.2):

  1. Coalesce shuffle partitions — small post-shuffle partitions are merged so reducers aren't underutilized.
  2. Skew join handling — a heavily skewed partition is split into sub-partitions and the other side is replicated.
  3. Switch join strategy — if a planned sort-merge join's input turns out small enough at runtime, AQE flips it to broadcast hash.

1.6 Why shuffle is expensive — the full picture

flowchart LR M1["Map task
partition by hash(key)"] --> W1["Local sort
+ spill to disk"] M2["Map task"] --> W2["Local sort
+ spill"] W1 --> N["Network fetch
(reducer pulls or
push-based shuffle)"] W2 --> N N --> R1["Reducer
merge + aggregate"] N --> R2["Reducer"] R1 --> S["Sort if needed
(SMJ)"] R2 --> S

Costs that pile up:

  • Serialization (Kryo/Java) on every record.
  • Disk spill if shuffle data exceeds memory.
  • Network I/O between executors.
  • Sorting (sort-based shuffle is the default).
  • Reducer pressure — one slow reducer becomes a straggler.

Push-based shuffle (3.2+): map outputs are pushed to the reducer's host and merged before the reduce stage starts. Big throughput win on K8s where no external shuffle service runs.

1.7 Memory model per executor

spark.executor.memory (heap)
  ├── reserved (300 MB)
  ├── user memory (40 %)         ← your local objects
  └── unified region (60 %)      ← splits dynamically
        ├── execution (sort, join, agg, shuffle)
        └── storage   (cache, broadcast)
spark.executor.memoryOverhead   ← off-heap, JVM overhead, Python workers

OOMs almost always mean: too few partitions, skewed partitions, oversized cache, or memoryOverhead too small for PySpark workers / native libs.

1.8 The five things that make a Spark job slow

flowchart LR Slow["Slow Spark Job"] --> A["Reading too much?"] Slow --> B["Shuffling too much?"] Slow --> C["One straggler task?"] Slow --> D["GC / OOM?"] Slow --> E["Tiny output files?"] A --> A1["Predicate pushdown,
column pruning,
partition design"] B --> B1["Broadcast small side,
bucketing, AQE"] C --> C1["Skew salting,
AQE skew join"] D --> D1["Increase partitions,
tune memoryOverhead,
persist DISK_SER"] E --> E1["Repartition before write,
compact, target 128–512 MB"]

1.9 Spark UI — your diagnostic flow

Always inspect in this order:

  1. SQL tab — physical plan + actual row counts. Spot accidental cartesians, wrong join strategy, missing pushdown.
  2. Stages tab — input size, shuffle read/write, spill, task duration distribution. Median vs max ratio reveals skew.
  3. Executors tab — GC time (>10% of task time → memory pressure), failed tasks, memoryUsed.
  4. Storage tab — cache hit ratio, blocks evicted.
  5. Environment tab — confirm config actually loaded what you set.

1.10 Join strategy cheat sheet

Join When chosen Cost Failure mode
Broadcast Hash small side < spark.sql.autoBroadcastJoinThreshold (10 MB default) no shuffle on big side, RAM on every executor driver/executor OOM if "small" side is wrong
Shuffle Hash medium tables, AQE may pick both sides shuffled, hash table on smaller side OOM if hash table too big
Sort-Merge both large, equi-join shuffle + sort each side sort spill, skew
Broadcast Nested Loop non-equi join, small side quadratic catastrophic on big input
Cartesian last resort, no condition <span class="katex-inline" data-katex=" A

1.11 Skew — recognize and fix

Symptoms (Spark UI): one task at 30 min while peers finish in 30 s; one partition's shuffle read is 50× the median; max task time ≫ median.

Why it happens: Hash partitioning sends every row with the same key to the same reducer. One hot key = one overloaded reducer.

Fixes, in order of cheapness:

  1. Turn on AQE skew join (spark.sql.adaptive.skewJoin.enabled=true).
  2. Pre-aggregate before the join (combine map-side).
  3. Salt the hot key.

Salting code (manual, when AQE isn't enough):

from pyspark.sql import functions as F

N = 32  # salt fan-out
big = df_big.withColumn("salt", (F.rand() * N).cast("int"))
small = (df_small
         .crossJoin(spark.range(N).withColumnRenamed("id", "salt")))
joined = (big.join(small, ["key", "salt"])
            .drop("salt"))

The salt scatters the hot key across N reducers. The small side is replicated N times so the join still produces every original row.

1.12 Coding patterns you should know cold

Top-K per group with a window:

from pyspark.sql import functions as F, Window

w = Window.partitionBy("category").orderBy(F.col("score").desc())
top3 = (df.withColumn("rk", F.row_number().over(w))
          .where("rk <= 3")
          .drop("rk"))

Dedupe — keep the latest record per key:

w = Window.partitionBy("id").orderBy(F.col("event_ts").desc())
latest = (df.withColumn("rk", F.row_number().over(w))
            .where("rk = 1")
            .drop("rk"))

Sessionization with watermark:

sessions = (events
    .withWatermark("ts", "30 minutes")
    .groupBy(F.session_window("ts", "15 minutes"), "user_id")
    .agg(F.count("*").alias("hits")))

Per-column profile in one pass (governance use case):

agg_exprs = []
for c in df.columns:
    agg_exprs += [
        F.sum(F.col(c).isNull().cast("int")).alias(f"{c}_nulls"),
        F.approx_count_distinct(c).alias(f"{c}_ndv"),
    ]
profile = df.agg(*agg_exprs)   # single shuffle, all columns at once

Idempotent upsert into Iceberg:

MERGE INTO catalog.db.dim_customer t
USING updates u
ON t.cust_id = u.cust_id
WHEN MATCHED AND u.op = 'D' THEN DELETE
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *;

1.13 Spark on Kubernetes — what's different

  • Each executor is a pod; the driver pod creates them.
  • Dynamic allocation + shuffle tracking (3.2+) means you don't need an external shuffle service — executors are kept around if they hold shuffle data still in use.
  • Pin executors to nodepools via taints/tolerations; spot pools for cost; non-spot pool for the driver.

1.14 Q&A with reasoning

Q1. Why is reduceByKey better than groupByKey().mapValues(_.sum)? Because reduceByKey performs a map-side combine before the shuffle. The reducer receives one partial aggregate per partition per key instead of every raw value. Less network, less spill, less reducer memory.

Q2. A task takes 30 min while peers finish in 30 s. What's happening? Classic data skew — one key has many more rows. Spark UI shows max ≫ median task time and one task's shuffle read is huge. Fix with AQE skew join or manual salting; pre-aggregate where possible.

Q3. Why do Python UDFs hurt performance? They serialize each row out of the JVM, ship it to a Python worker, run Python, and bring results back. Catalyst can't see inside the UDF, so codegen and predicate pushdown are blocked. Use SQL functions or Pandas UDFs (Arrow batches; columnar, vectorized) instead.

Q4. cache() vs checkpoint() — when to use which? cache keeps lineage; if a partition is lost, Spark recomputes from the lineage chain. checkpoint writes to reliable storage and truncates lineage. Use checkpoint to break long iterative chains (graph algorithms, ML loops) where lineage recomputation would be catastrophic on failure.

Q5. repartition vs coalesce? repartition(N) does a full shuffle to evenly redistribute. coalesce(N) (where N < current partitions) merges existing partitions without shuffle, but produces unbalanced output. repartitionByRange(N, col) partitions by range using sampling — useful before sorted writes.

Q6. PySpark task crashes with Container killed by YARN, exceeding memory limits. Heap is fine. Why? The Python worker memory and native libs live in memoryOverhead, not heap. Bump spark.executor.memoryOverhead to ~10–20 % of executor memory and avoid loading large objects into the Python worker.

Q7. Why does broadcast join risk OOM? The small side is collected to the driver, then sent to every executor. If "small" is actually 500 MB and you have 200 executors, you've allocated 100 GB of replicated memory and pulled 500 MB through the driver.

Q8. What does df.explain(true) show, and what would you look for? Parsed → Analyzed → Optimized logical plans + physical plan. You scan for: chosen join type, presence of BroadcastExchange, HashAggregate vs SortAggregate, predicate pushdown into FileScan, partition filters, and AQE rules applied.


2. Spark Tuning — Diagnostic Recipes

These are the recipes you should rehearse aloud.

2.1 Tuning checklist (memorize)

  1. Right file format: Parquet/ORC with snappy/zstd; target file size 128–512 MB.
  2. Partitioning on disk: low-cardinality cols (date, region). NOT high-cardinality (user_id).
  3. Bucketing when you join the same large tables repeatedly on the same key.
  4. spark.sql.shuffle.partitions: tune so each shuffle partition processes ~128–256 MB.
  5. AQE on: spark.sql.adaptive.enabled=true, plus coalesce + skew-join.
  6. Broadcast the small side; hint broadcast(df) if Catalyst doesn't pick it up.
  7. Avoid UDFs when SQL functions exist; if needed, Pandas UDFs.
  8. Predicate pushdown: filter before join; let Parquet stats prune.
  9. Column pruning: select only what you need, early.
  10. Cache only re-used DataFrames; prefer MEMORY_AND_DISK_SER for large.
  11. Dynamic Partition Pruning for star-schema joins.
  12. Speculation (spark.speculation=true) catches stragglers.
  13. spark.executor.memoryOverhead for PySpark / native libs (10–20 %).
  14. Avoid collect() on big DFs; use toLocalIterator() or write out.
  15. Z-order / sort within partitions for selective queries (Iceberg/Delta).

2.2 The "this job is slow" interview answer

1. I would not guess. I'd open Spark UI.
2. SQL tab → physical plan, join strategy, scan size, partition filters.
3. Stages tab → shuffle read/write, spill, GC time, max vs median task time.
4. Executors tab → GC %, OOM, memory used, failed tasks.
5. Map symptom to fix:
     huge scan      → pushdown / partitioning / column pruning
     huge shuffle   → broadcast / bucket / AQE / skew
     stragglers     → AQE skew, salting, speculation
     GC time high   → more partitions, persist DISK_SER, raise memoryOverhead
     small files    → repartition before write, compaction
6. I'd validate the fix by re-checking the same UI metrics.

2.3 Q&A with reasoning

Q. Why does Dynamic Partition Pruning help star schemas? A fact-dim equi-join with a filter on the dim can be optimized: Spark first runs the dim filter, collects the matching dim keys, and pushes them as a partition filter on the fact scan. The fact scan only reads partitions that survived. Saves enormous I/O on large facts.

Q. Why are small files harmful on object stores? Each open + LIST has fixed overhead, throughput is per-prefix, and metadata grows. A 10 TB table in 1 KB files is unusable. Target 128–512 MB; compact regularly.

Q. How does AQE detect skew at runtime? After each shuffle stage, AQE looks at partition sizes. If one partition is much larger than the median (configurable factor), it splits that partition into sub-partitions and replicates the matching rows from the other side of the join.


Part II — Storage and Lakehouse

3. Distributed Storage — Why Object Stores Need Lakehouses

flowchart LR App["Compute
Spark / Trino / Flink"] -->|"GET / PUT / LIST"| Obj["Object Store
S3 / ADLS / GCS"] Obj --> P["Per-prefix limits
~3.5k PUT, 5.5k GET / sec"] Obj --> R["No atomic rename
across multiple objects"] Obj --> Lc["Strong consistency now,
but not multi-object atomic"] R --> LH["Lakehouse commit
protocol fixes this"] Lc --> LH

3.1 HDFS vs object stores at a glance

HDFS S3 ADLS Gen2 GCS
Hierarchy true flat hierarchical (HNS) flat
Atomic rename yes (cheap) no (copy + delete) yes (HNS) no
Append yes no yes no
Multi-region no with replication with GRS with multi-region

3.2 Why "no atomic rename" matters

Old Hadoop committers wrote _temporary/.../task_attempt_X and then renamed to the final path on commit. On HDFS this is a metadata flip. On S3 it's O(file_count) × bytes of copy. With speculative execution, two attempts of the same task can both finish; the classic committer can leave both outputs.

That's why we have:

  • S3A magic committer / directory committer — uses S3 multipart upload so the final commit is a single atomic operation.
  • Lakehouse table formats (Iceberg/Delta/Hudi) — the "commit" is a metadata pointer swap in the catalog, not a file rename.

3.3 File format quick-take

Format Compression Splittable Schema evolution Primary use
CSV / JSON poor partial sloppy landing only
Avro good yes row-based, schema registry Kafka topics, streaming
Parquet great (snappy/zstd) yes column add/drop, by name analytical default
ORC great yes yes Hive ecosystem

3.4 Performance levers

  • Avoid small files — compact to 128–512 MB Parquet.
  • Partition layout: tenant=.../date=.../hour=... for time-aligned scans.
  • Column projection + predicate pushdown with Parquet/ORC stats.
  • Z-order / Hilbert for multi-column locality (Iceberg, Delta).
  • Caching layer (Alluxio, EMRFS S3 cache) for repeated reads.

4. Lakehouse and Apache Iceberg — Tables on Top of Object Storage

4.1 The problem in one paragraph

Raw Parquet on S3/ADLS/GCS gives cheap storage. It does not give you atomic multi-file writes, safe schema evolution, row-level updates/deletes, time travel/audit, or fast pruning at petabyte scale. Iceberg, Delta, and Hudi all add a metadata layer that turns folders of files into transactional tables.

4.2 Iceberg layout — every level matters

flowchart TB Cat["Catalog (REST / Glue / HMS / Nessie)
atomic CAS pointer"] --> M0["metadata/v3.metadata.json
(current snapshot pointer)"] M0 --> ML["Manifest list (snap-xxx.avro)
+ partition summary stats"] ML --> M1["Manifest 1
+ per-file stats"] ML --> M2["Manifest 2"] M1 --> D1["data/part=2026-04-30/file-1.parquet"] M1 --> D2["data/part=2026-04-30/file-2.parquet"] M2 --> D3["data/part=2026-04-29/file-3.parquet"]
  • Catalog holds the only pointer to the current metadata.json — that's where atomic CAS happens.
  • metadata.json records schema (with column IDs), partition spec, sort order, and snapshots.
  • Manifest list holds per-manifest partition-summary stats so a planner can prune whole manifests.
  • Manifests list data files plus per-file stats: row counts, null counts, min/max per column.
  • Data files are Parquet/ORC.

4.3 Read path — why Iceberg scales beyond Hive

flowchart LR Q["Query"] --> C["Catalog: current
metadata.json"] C --> ML["Manifest list"] ML -->|"partition prune"| M["Surviving manifests"] M -->|"file-level prune (min/max, nulls)"| F["Surviving data files"] F -->|"column projection + row-group pruning"| Read["Read"]

Hive listed directories on S3 (slow, eventually consistent). Iceberg reads a small set of metadata files that already enumerate data files plus stats — most pruning happens before any data file is opened.

4.4 Write path and commit protocol

  1. Each task writes Parquet to data/.
  2. On task completion, manifest entries are assembled.
  3. Driver builds a new metadata.json referencing a new snapshot.
  4. Driver calls catalog commit(currentSnap, newMetadata) — atomic CAS at the catalog level.
  5. On conflict (someone else committed first), the loser rebases changes onto the new snapshot and retries.

That's optimistic concurrency. It works because the only contention is the single catalog pointer.

4.5 Killer features and why they matter

Feature What it gives you
Hidden partitioning Declare PARTITIONED BY (days(ts), bucket(16, user_id)); users query natural columns and Iceberg rewrites predicates.
Partition evolution Change strategy without rewriting old data.
Schema evolution by column ID Renames are safe — identity stays the same.
Time travel FOR VERSION AS OF / FOR TIMESTAMP AS OF for audit and rollback.
Row-level ops Merge-on-read (delete files, position deletes, equality deletes) or copy-on-write.
Multi-engine Spark, Trino, Flink, Snowflake, BigQuery all read the same table via the REST catalog.

4.6 Iceberg vs Delta vs Hudi

Need Best
Multi-engine (Spark + Trino + Flink + Snowflake) Iceberg
Tightly coupled with Databricks ecosystem Delta (UniForm makes it Iceberg-readable too)
High-velocity streaming upserts (CDC) Hudi (or Iceberg + Flink)
Time-travel + branch/tag for data science Iceberg with Nessie or native tags

Hudi has two flavors: - Copy-on-Write rewrites whole files on update — fast reads, expensive writes. - Merge-on-Read appends delete/delta files merged at read — fast writes, slower reads until compaction.

4.7 Maintenance you cannot skip

Job Why
expire_snapshots Bound metadata growth, free old data files.
rewrite_data_files Compact small files into 128–512 MB chunks.
rewrite_manifests Keep manifest count low for fast planning.
remove_orphan_files Clean up files left by aborted writes.

Skip these and the table eventually slows to a crawl.

4.8 Governance hooks (why this matters for catalogs)

Iceberg metadata is itself queryable:

SELECT * FROM catalog.db.table.snapshots;
SELECT * FROM catalog.db.table.files;
SELECT * FROM catalog.db.table.partitions;
SELECT * FROM catalog.db.table.history;

That powers lineage, audit, retention enforcement, and GDPR purges (DELETE + compact + expire snapshots).

4.9 Q&A with reasoning

Q1. Parquet vs Iceberg — same thing? No. Parquet is a file format (columnar layout inside one file). Iceberg is a table format (which files belong to the table at which snapshot, with what schema). Iceberg tables contain Parquet files.

Q2. Why does Iceberg scale where Hive directory listing doesn't? Hive reads tables by listing directories. Iceberg reads metadata files that already enumerate files plus min/max stats, so most pruning happens before any data file is opened.

Q3. How does Iceberg handle two writers at once? Optimistic concurrency: each writer prepares new metadata referencing the snapshot it read, then performs a single catalog-level CAS. The loser detects a conflict, rebases its changes onto the new snapshot, and retries.

Q4. Equality deletes vs position deletes? Position deletes identify rows by file + row index → cheap at read time (skip rows by offset). Equality deletes identify rows by predicate → require a join at read time. Position deletes are cheaper to read; equality deletes are easier to produce in streaming CDC.

Q5. How would you do GDPR delete on a 10 PB table? 1) DELETE FROM table WHERE user_id IN (...) — Iceberg writes equality/position delete files. 2) Run rewrite_data_files to compact, materializing the deletes physically. 3) expire_snapshots to remove old snapshots that still reference deleted rows. 4) remove_orphan_files. After step 3 the deleted data is unrecoverable through time travel.


Part III — Indexes and Search

5. Graph Databases — Why Lineage Is Naturally a Graph

Some questions are fundamentally about relationships: "what depends on this column, 5 hops downstream?" Recursive SQL CTEs become slow because every hop is another join. Graph databases store direct relationship pointers (index-free adjacency), so a hop is a pointer chase, not a join.

5.1 Property graph model for lineage

flowchart LR J["Job
:Job"] -->|WRITES| D["Dataset orders
:Dataset"] D -->|HAS| C1["Column total
:Column"] D2["Dataset orders_eu
:Dataset"] -->|HAS| C2["Column eu_total
:Column"] C2 -->|DERIVED_FROM| C1 C2 -->|TAGGED_AS| T["Tag PII
:Tag"] U["User Alice
:User"] -->|OWNS| D P["Policy mask_eu
:Policy"] -->|GOVERNS| C2

5.2 Cypher essentials

// Anchor on an indexed node, bound depth
MATCH (start:Column {id:'sales.amount'})
MATCH (start)-[:DERIVED_FROM*1..5]->(down:Column)
RETURN DISTINCT down.id;

// Temporal lineage — what was upstream on a date?
MATCH (a:Dataset {id:$id})-[r:DERIVED_FROM]->(b)
WHERE r.valid_from <= datetime($asOf) <= coalesce(r.valid_to, datetime())
RETURN b;

// Shortest path between two assets
MATCH p = shortestPath(
  (a:Dataset {id:'src'})-[*]->(b:Dataset {id:'dst'}))
RETURN p;

5.3 Things that explode at scale

  • Unanchored traversals — start from indexed nodes, always.
  • Supernodes — a Tag connected to millions of columns will blow up any traversal that touches it. Split by tenant/domain/time, or filter relationship types early.
  • Unbounded depth — always cap with *1..N.

5.4 Q&A with reasoning

Q. Why is index-free adjacency fast? Once the start node is found via index, each hop follows stored relationship pointers. There's no "look up neighbors by foreign key" join cost like in relational databases.

Q. How do you model temporal lineage? Put valid_from and valid_to on relationships, or version snapshots. Then "what was upstream on 2026-04-30?" becomes a predicate filter.

Q. When is a graph DB the wrong choice? When the workload is mostly aggregations, transactional records, or wide tabular reporting. Use relational/Iceberg for source of truth and graph as a traversal index that can be rebuilt from the source of truth.

Q. What is a supernode and why is it bad? A node with extremely high degree (tag connected to millions of columns). Traversals through it explode. Mitigations: filter by relationship type at the supernode, partition the supernode by tenant/domain, or skip it in path expansions.


6. Elasticsearch — Inverted Index, Discovery, Hybrid Search

6.1 Why Elasticsearch exists

Full-text discovery needs ranking, tokenization, typo tolerance, filters, facets, and low latency across millions or billions of documents. SQL LIKE '%text%' scans too much and doesn't understand language.

6.2 Architecture

flowchart TB Doc["Document"] -->|"index"| Sh["Shard
(primary)"] Sh --> Seg1["Lucene segment 1
(immutable)"] Sh --> Seg2["Lucene segment 2"] Sh --> Seg3["Lucene segment 3"] Seg1 -.->|"merge"| Big["Larger segment"] Seg2 -.->|"merge"| Big Sh -->|"replicate"| Sh2["Shard
(replica)"]
  • A shard is a Lucene index. Segments are immutable inverted-index files; refresh creates new segments; merge consolidates.
  • Primary shards scale indexing throughput; replicas scale read throughput and provide HA.

6.3 Inverted index + scoring

"customer" → [(d1, tf=3, pos=[1,5,10]), (d4, tf=1, pos=[2]), ...]
  • BM25 scoring (default since 5.x).
  • Analyzer = tokenizer + filters (lowercase, stemmers, synonyms).
  • Mapping: text (analyzed for full-text) vs keyword (exact for filters/aggregations/sort).

6.4 Document lifecycle

  1. Index → translog + in-memory buffer.
  2. Refresh (default 1 s) → new segment becomes searchable.
  3. Flush → fsync segments + clear translog.
  4. Merge → fewer larger segments. force_merge on read-only indexes.

6.5 Query DSL — what to actually write

{
  "query": {
    "bool": {
      "must":   [{"match": {"name": "customer"}}],
      "should": [{"match": {"description": "PII"}}],
      "filter": [{"term": {"tenant": "acme"}}, {"term": {"sensitivity": "high"}}],
      "must_not": [{"term": {"deleted": true}}]
    }
  },
  "aggs": {"by_tag": {"terms": {"field": "tags.keyword"}}}
}
  • filter doesn't score and is cacheable — put exact-match conditions there.
  • match is analyzed; term is exact (use only on keyword fields).
  • Multi-match for cross-field; phrase with slop; fuzzy for typo tolerance; k-NN for vector search (8.x+).

6.6 Capacity planning rules

  • Shard size sweet spot 10–50 GB. Too small → overhead. Too large → slow recovery, slow merges.
  • Heap ≤ 32 GB (compressed oops boundary); the rest of RAM should be OS file cache.
  • Index lifecycle (ILM) — hot/warm/cold/frozen tiers.

6.7 Operational pitfalls

  • Mapping explosion — dynamic mapping on user JSON creating thousands of fields. Use dynamic: strict or dynamic: false.
  • Deep pagination (from + size > 10k) — use search_after with PIT.
  • Refresh interval of 1 s is wasteful for bulk loads — set to -1 then back.
  • Hot shards — node-level imbalance; use shard allocation awareness.

6.8 Q&A with reasoning

Q. Why limit JVM heap to 32 GB? Above that, JVM stops using compressed object pointers and you actually lose memory savings. Plus most ES performance comes from the OS file cache anyway.

Q. match vs term? match analyzes the query text (lowercases, stems) and searches the analyzed field; term is exact byte match against the indexed token. Use term on keyword fields, not on text.

Q. What is hybrid retrieval? BM25 (keyword) + dense vector (semantic), fused with Reciprocal Rank Fusion (RRF). BM25 catches exact terms; dense catches paraphrases. Together they're more robust than either alone.


Part IV — AI / ML

7. LLMs — What's Actually Inside the Box

7.1 The honest definition

An LLM is a neural network trained to predict the next token given previous tokens. Everything that looks like reasoning, summarization, or chat emerges from next-token prediction plus instruction tuning and alignment. It is not a database. It compresses patterns into weights — that's why it can hallucinate.

7.2 Transformer architecture

flowchart LR T["Tokens"] --> E["Embedding +
positional encoding
(RoPE / ALiBi)"] E --> L1["Layer 1:
Multi-head Self-Attention
+ FFN + Norm (RMSNorm)"] L1 --> L2["Layer 2 ... N"] L2 --> O["Output projection
then softmax
then next token"] O -->|"append, repeat"| T L1 -.->|"cache K, V"| KV["KV cache
(per layer)"] L2 -.-> KV
  • Self-attention: . Cost is in sequence length.
  • Multi-head: project into h heads, run attention in parallel, concatenate.
  • Decoder-only (GPT, Llama, Mistral, Claude) — generation. Encoder-only (BERT) — embeddings/classification. Encoder-decoder (T5) — translation/summarization.
  • Modern: RMSNorm, SwiGLU activation, RoPE position encoding, GQA (grouped-query attention) to shrink KV cache.

7.3 Why long context is expensive — and what saves it

  • Attention is in sequence length, so a 128k-token prompt costs ~16k× more than a 1k prompt for the attention matrix alone.
  • KV cache stores past keys/values so generation step t doesn't recompute attention over t-1 previous tokens — turns autoregressive generation from to per step.
  • FlashAttention is an IO-aware kernel that tiles attention to keep working sets in SRAM and avoids materializing the full matrix.
  • RoPE scaling / NTK / YaRN extend a model's effective context length without retraining from scratch.
  • Mixture-of-Experts (MoE): only a subset of expert FFNs activate per token (e.g., 2 of 8). Throughput per parameter improves dramatically.

7.4 Embeddings and ANN search

An embedding maps text to a dense vector (typically 768–3072 dimensions). Similarity is cosine (after normalization, equivalent to dot product). Searching billions of vectors fast uses approximate indexes:

Index Idea Notes
HNSW hierarchical navigable small-world graph default in pgvector, FAISS, Elastic, Weaviate, Pinecone
IVF-PQ coarse cluster + product quantization memory-efficient
ScaNN partition + quantization Google's

The trade-off triangle: recall ↔ latency ↔ memory. Tune efSearch/nprobe to move along that triangle.

7.5 Inference optimizations

  • Quantization — INT8 / INT4 (GPTQ, AWQ); 2–4× speed/memory at small quality loss.
  • Speculative decoding — a small draft model proposes tokens; the big model verifies in parallel.
  • Continuous batching (vLLM PagedAttention) — 5–20× throughput improvement.
  • Distillation — train a smaller student from a larger teacher.

7.6 Fine-tuning vs RAG

Goal Tool
Inject up-to-date private knowledge RAG
Change tone, format, domain style LoRA / QLoRA
Teach a brand-new skill where prompting fails full SFT
Align with human preferences RLHF / DPO

Default: start with RAG. Add LoRA only if RAG can't fix it. Full SFT is rarely worth it for enterprise apps.

7.7 Q&A with reasoning

Q. Why is attention ? Every token computes a similarity score against every other token. n queries × n keys = pairs.

Q. Why does the KV cache help? During autoregressive generation, attention only changes for the new token. Caching K and V for prior tokens makes each step instead of .

Q. What does FlashAttention break? The memory wall — it avoids writing the full attention matrix to HBM, instead streaming tiles through fast SRAM. Same math, much less DRAM traffic.

Q. Why do tokenizers matter for cost? Pricing and context limits are per token. A column name like customer_account_id_v2 may tokenize into 5–7 tokens; rare strings cost more.

Q. Why does normalization matter for cosine vs dot product? If vectors aren't unit-length, dot product favors longer vectors regardless of direction. Normalize first; then cosine ≡ dot.

Q. RLHF vs DPO? Both align the model to preferences over candidate pairs. RLHF trains a reward model and uses PPO. DPO skips the reward model and directly optimizes a preference loss — simpler, often more stable, comparable quality.


8. RAG and Agentic Systems — Models That Reach for Data and Tools

8.1 Vanilla RAG pipeline

flowchart LR U["User query"] --> QR["Query rewrite /
HyDE expansion"] QR --> H["Hybrid retrieval
BM25 + dense
+ tenant filter"] H --> RR["Reranker
(cross-encoder)"] RR --> P["Prompt assembly
system + chunks + query"] P --> LLM["LLM"] LLM --> A["Answer + citations"] Idx[("Vector index
+ keyword index")] --> H Docs["Source docs / metadata"] -->|"chunk + embed"| Idx

8.2 Quality levers, in order of impact

  1. Chunking — semantic, with 10–20 % overlap, with metadata (source, owner, tenant, sensitivity, timestamp). For tabular metadata, chunk = a dataset card with name, description, key columns, owner, tags.
  2. Hybrid retrieval — BM25 + dense, fused via Reciprocal Rank Fusion.
  3. Query rewriting — HyDE (have the LLM imagine an answer, embed it, retrieve real docs by similarity), multi-query, decomposition.
  4. Reranker — cross-encoder (bge-reranker, Cohere rerank) over top-50 → top-K.
  5. Pre-retrieval metadata filters — tenant, time, sensitivity. Always before retrieval, not after.
  6. Prompt — citations format, structured output via JSON schema or function calling.
  7. Caching answers by (tenant, query_hash, content_hash) with stale-while-revalidate.

8.3 Why hybrid retrieval beats either alone

  • Dense embeddings nail semantic paraphrase ("client identifier" → customer_id).
  • BM25 nails exact identifiers, code symbols, rare terms (PII_CUSTOMER_EMAIL_V2).
  • RRF is parameter-free and robust:
score(d) = Σ over rankers   1 / (k + rank_r(d))         # k often 60

8.4 Evaluation (treat it as a test suite)

Layer Metrics
Retrieval recall@k, precision@k, MRR, NDCG
Generation faithfulness, answer relevance, context precision/recall
Agent task success, tool error rate, iterations, cost per task

Tools: Ragas, TruLens, DeepEval, Langfuse. Use LLM-as-judge with a calibration set, plus periodic human review.

8.5 Agentic loop

flowchart TB P["Plan"] --> A["Act
(call a tool)"] A --> O["Observe
(tool result)"] O --> R["Reflect
done? need more?"] R -->|"continue"| P R -->|"stop"| Out["Final answer
+ trace"]

Useful agent = LLM + a small set of typed tools + a loop with hard limits:

  • max tool calls per task,
  • token + dollar budget,
  • wall-clock timeout,
  • deterministic fallbacks for parts that should never depend on an LLM.

8.6 Worked example — autonomous PII classifier

flowchart TB In["New column"] --> Reg["Regex
(SSN, email, IBAN)"] Reg -->|"hit"| Done["Auto-tagged
+ rationale"] Reg -->|"miss"| Dict["Dictionary
(cities, names, codes)"] Dict -->|"hit"| Done Dict -->|"miss"| ML["ML / NER classifier"] ML -->|"high conf"| Done ML -->|"ambiguous"| Sample["profile_column
(Spark sample)"] Sample --> VS["vector_search
(similar past columns)"] VS --> J["LLM Judge:
aggregate + rationale + confidence"] J -->|"conf >= threshold"| Write["write_to_catalog
(Iceberg + graph)"] J -->|"conf < threshold"| Steward["Human steward
review"] Steward --> Train["labeled example
weekly retrain"] Train --> ML

Design rules to defend in interview:

  • Determinism first, LLM only for ambiguity and rationale.
  • Tools are narrow and typed (profile_column, vector_search, propose_tag, write_to_catalog).
  • Per-tenant authorization on every tool.
  • Hard caps: max 5 tool calls, max $0.05 per column.
  • Trace every step (Langfuse / OTel GenAI conventions).
  • Eval set of 10k labeled columns; CI runs the full pipeline weekly; alert on regression.
  • Feedback loop — steward overrides become training data.

8.7 Failure modes and mitigations

Failure Mitigation
Hallucination require citations; reject if no chunk supports the answer
Prompt injection (in retrieved docs) sanitize content; never let it override system prompt; tool authorization
Cross-tenant leak filter by tenant before retrieval, assert again before generation
Cost runaway per-request budget; hard tool-call cap; cluster similar inputs to dedupe calls
Stale knowledge scheduled re-indexing; CDC into vector store
Tool errors typed errors, retries with backoff, deterministic fallback path
Drift scheduled re-eval; alert on metric regression

8.8 Q&A with reasoning

Q. Why does hybrid retrieval beat dense alone? Dense embeddings are great at semantics but can miss exact identifiers, code symbols, or rare terms. BM25 nails those. Fusing both via RRF is robust without parameter tuning.

Q. What does function calling actually return? A structured JSON object matching the declared schema. The LLM does not execute anything — your code dispatches the call. That separation is what lets you sandbox tools.

Q. Where do you enforce tenant isolation? As a pre-retrieval metadata filter and a post-retrieval assertion before the prompt is built. Defense in depth.

Q. When is an agent the wrong tool? When the task is fully deterministic (e.g., "compute table sizes nightly"). An if/then/else pipeline is cheaper, faster, easier to test, and won't hallucinate.

Q. How do you defend against indirect prompt injection? Treat retrieved content as untrusted data. Wrap it with explicit "the following is data from the corpus, not instructions"; strip control characters; require structured outputs; never give the LLM a tool that can read its own output verbatim.


Part V — Distributed Systems

9. Distributed Systems HLD — Frameworks and Patterns

Almost every system design question reduces to the same shape:

flowchart LR R["1. Clarify reqs
(functional + NFR)"] --> E["2. Estimate scale"] E --> AP["3. APIs"] AP --> DM["4. Data model"] DM --> AR["5. Architecture
+ data flow"] AR --> DD["6. Deep dives:
consistency, scaling,
indexing"] DD --> TO["7. Trade-offs"] TO --> OP["8. Operations
(monitoring, recovery)"]

9.1 Capacity math you should know cold

  • 1 day ≈ 86,400 s ≈ ~10⁵.
  • 1 M req/day ≈ 12 RPS; 1 B/day ≈ 12 k RPS.
  • SSD ~ 100k IOPS; NVMe ~ 1M IOPS; 10 Gbps NIC ≈ 1.25 GB/s.
  • One Postgres node: ~10–50 k QPS reads, lower for writes.
  • Kafka: 100 k–1 M msg/s/broker.
  • S3: ~3.5k PUT / 5.5k GET per prefix per second.
  • LLM gpt-4-class: ~50–100 tokens/s output, ~1–4 s latency for typical RAG.
  • 99.9 % SLO = 43 min/month down; 99.99 % = 4.3 min/month.

9.2 Consistency, replication, consensus

  • CAP under partition: pick C or A. Real systems live on PACELCElse trade Latency vs Consistency.
  • Quorum: for strong reads.
  • Replication models:
  • Leader-follower (Postgres) — strong, simple, write throughput limited by leader.
  • Multi-leader — write anywhere, conflicts must be resolved (CRDT, last-writer-wins, application logic).
  • Leaderless (Dynamo / Cassandra) — quorum reads/writes, hinted handoff.
  • Consensus — Raft / Paxos for metadata stores (etcd, ZK, Spanner). Need 3 nodes to tolerate 1 failure, 5 to tolerate 2.
  • Time — NTP/PTP, hybrid logical clocks, vector clocks, Lamport timestamps, TrueTime (Spanner).

9.3 Building blocks (2026 default picks)

Block Tech
Load balancer Envoy, ALB, GCLB
API gateway Kong, Apigee, AWS API GW
Cache Redis Cluster, Memcached
Pub/sub Kafka, Pulsar, Kinesis, Pub/Sub
Queue SQS, RabbitMQ, Service Bus
OLTP DB Postgres / Aurora / Cloud SQL / Spanner / CockroachDB
Wide-column / KV DynamoDB, Cassandra, ScyllaDB, Bigtable
Search Elasticsearch / OpenSearch
Vector pgvector, Pinecone, Weaviate, Elastic k-NN
Lake / Lakehouse S3 + Iceberg + Trino / Spark
DW Snowflake / BigQuery / Redshift
Stream proc Spark Structured Streaming, Flink
Observability Prom + Grafana + OTel + Loki/Tempo
Coordination etcd, ZK, Consul

9.4 Patterns you'll reach for

  • Outbox pattern — atomic DB write + event emit. Eliminates the dual-write inconsistency.
  • Saga — cross-service transactions with compensating actions when 2PC isn't viable.
  • Idempotency keys — make at-least-once safe.
  • CQRS / event sourcing — separate write model from read model.
  • Backpressure + bounded queues.
  • Retry + exponential backoff + jitter + circuit breaker + bulkhead.
  • Cache stampede mitigation: request coalescing, jittered TTL, stale-while-revalidate.
  • Hot key — salt, secondary indexes, replicate.
  • Schema migration without downtime — expand → migrate → contract.
  • Backfills without overload — rate-limited, idempotent, resumable.

9.5 The outbox pattern in detail

flowchart LR Svc["Service"] -->|"single tx: insert order + insert outbox row"| DB[("Postgres")] DB --> Relay["Relay / Debezium
(reads outbox table)"] Relay -->|"publish"| K["Kafka topic"] K --> C1["Consumer A
(idempotent)"] K --> C2["Consumer B
(idempotent)"]

What it solves: the dual-write problem. If you write to the DB and then publish to Kafka as two separate calls, a crash between them leaves the system inconsistent. Outbox writes the event row in the same DB transaction as the business row. A separate relay reads outbox entries and publishes them; consumers must be idempotent (dedup by event id).

9.6 Worked example — multi-tenant lineage service

flowchart LR Pr["Producers
Spark / Airflow / dbt"] -->|"OpenLineage RunEvents"| K["Kafka
(topic per tenant)"] K --> Ing["Streaming ingestor
validate, dedupe, enrich"] Ing --> Ice[("Iceberg
raw run events
(audit + replay)")] Ing --> G[("Graph store
Neo4j / Neptune")] Ice --> Q["Query API"] G --> Q Q --> UI["UI / agents
upstream/downstream,
column-level diff,
impact analysis"]

Why two stores? Iceberg is the durable source of truth (cheap, replayable, auditable). The graph store is a traversal index that can be rebuilt from Iceberg if it corrupts. Writes go to both; the graph is eventually consistent with Iceberg.

9.7 Common HLD designs to be ready for

  1. Distributed metadata catalog (Google-scale).
  2. Lineage service spanning lakehouse + DW + dashboards.
  3. PII discovery / classification platform.
  4. Real-time data quality scoring on streams.
  5. Multi-tenant search service.
  6. Vector search at scale (multi-tenant RAG).
  7. Idempotent ingestion service with exactly-once into Iceberg.
  8. Job orchestrator for Spark DAGs with retries and SLA.
  9. Distributed rate limiter / API gateway.
  10. CDC pipeline Postgres → Kafka → Iceberg → DW.
  11. Feature store for ML.
  12. Observability platform.

9.8 Q&A with reasoning

Q. What does the outbox pattern solve? The dual-write problem. You want to update a DB row and publish an event. If you do them as two separate calls, a crash between them leaves the system inconsistent. Outbox writes the event row in the same DB transaction, then a relay reads the outbox table and publishes. Consumers must be idempotent.

Q. At-least-once vs exactly-once? At-least-once may deliver duplicates after retry. "Exactly-once" is really effectively once: at-least-once + idempotent consumer (dedup via key) or transactional sink (Kafka transactions, Iceberg/Delta MERGE).

Q. Why use consistent hashing? When a node is added or removed, only K/N keys move (where N is number of nodes), instead of remapping nearly everything. Virtual nodes smooth load distribution.

Q. How do you avoid hot-key on a global counter? Bucket the key into many shards (counter:42#0..M-1), increment a random shard, sum periodically. Trade read cost for write scalability.

Q. Why is R + W > N the strong-read condition? Because read and write quorums then overlap on at least one replica, so a read is guaranteed to see the latest committed write — provided you have a tie-breaker (vector clocks, version numbers, last-writer-wins).

Q. Saga vs 2PC? 2PC is synchronous and blocking — coordinator failure can leave participants stuck holding locks. Saga is async, with compensating actions for each step; eventual consistency, no global lock. Saga fits microservices; 2PC fits a single trusted coordinator on top of capable resource managers.


10. SQL and RDBMS — Indexes, Plans, Isolation

10.1 Mental model

A database must answer three hard questions:

  1. How do I store data safely? → transactions, WAL, durability.
  2. How do I find data quickly? → indexes, statistics, optimizer.
  3. How do I handle many users at once? → isolation, locks, MVCC.

Most SQL performance reduces to: how many rows must we touch, and can an index reduce that?

10.2 ACID + isolation levels

Level Dirty Non-repeatable Phantom
Read Uncommitted yes yes yes
Read Committed (PG default) no yes yes
Repeatable Read (MySQL InnoDB default) no no yes (PG: no, via SSI)
Serializable no no no
  • MVCC (Postgres, Oracle, InnoDB) — readers don't block writers via tuple versions + visibility map.
  • Long-running transactions prevent vacuum cleanup → table bloat.

10.3 Indexes — what they're for

Index Use
B-Tree range + equality
Hash equality only
GIN/GiST (PG) full-text, JSONB, geospatial
BRIN block range, perfect for append-only time series
Bitmap low cardinality (warehouse OLAP)
Covering / INCLUDE index includes payload columns; avoids heap fetch
Composite leftmost-prefix rule

Selectivity rule: indexes pay off when the filter selects < ~5–10 % of rows.

10.4 Why WHERE date_trunc('day', ts) = '2026-04-30' is slow

Applying a function to the indexed column prevents direct B-tree lookup. Rewrite as a range:

WHERE ts >= '2026-04-30' AND ts < '2026-05-01'

Now Postgres does a B-tree range scan.

10.5 Reading EXPLAIN ANALYZE

Operators to recognize:

  • Seq Scan — full table; expected on tiny tables, scary on big ones.
  • Index Scan / Index-Only Scan — wins when covering index avoids heap.
  • Bitmap Heap Scan — combine multiple indexes, then fetch.
  • Hash Join — build hash on smaller side.
  • Merge Join — both sides sorted on join key.
  • Nested Loop — fast when outer is small and inner is indexed.

Look for estimated vs actual rows — if they diverge by 10×+, run ANALYZE to refresh stats; the optimizer is making bad decisions.

10.6 Online schema migration — expand / contract

flowchart LR A["Add new column
(nullable, default lazy)"] --> B["Backfill in batches
(rate-limited, resumable)"] B --> C["App writes BOTH
old + new columns"] C --> D["App reads NEW
(verify parity)"] D --> E["Drop old column"]

Never block production with a NOT NULL DEFAULT on a billion-row table.

10.7 Q&A with reasoning

Q. How does a composite index (a, b, c) help queries? Usable for filters on a, (a, b), or (a, b, c)not for b alone (leftmost-prefix rule). Order columns by selectivity and common filter shape.

Q. Why is UUID a worse PK than BIGINT for clustered indexes? UUIDs are random → inserts hit random pages → page splits, fragmentation, poor cache locality. ULID / UUIDv7 (time-ordered) restore sequentiality.

Q. Hash join vs merge join vs nested loop — when each? - Nested loop — small outer + indexed inner; O(outer × log inner). - Hash join — equi-joins, both moderate; build hash on smaller side, O(N+M). - Merge join — equi-join with both sides already sorted on the key; O(N+M) without hash memory.

Q. Why does VACUUM FULL block writers? It rewrites the table to reclaim space and takes an ACCESS EXCLUSIVE lock. Use pg_repack for online reclamation.


Part VI — Modeling and Governance

11. Data Modeling — Choose by Query Shape

The right model depends on who asks what, not on which model is fashionable.

11.1 OLTP vs OLAP

OLTP OLAP
Goal transactions analytics
Schema 3NF, normalized denormalized (star/snowflake)
Workload small point reads/writes wide scans, aggregations
Engine PG / MySQL / Oracle Snowflake / BigQuery / Iceberg

11.2 Dimensional (Kimball) modeling

  • Fact tables — measurements + FKs to dimensions; e.g. sales_fact (sale_id, date_key, product_key, store_key, qty, amount).
  • Dimension tables — descriptive context (product, customer, date, store).
  • Star = fact + flat dims. Snowflake = normalized dims (more joins, less duplication).
  • Grain — the most atomic level a fact row represents. Define grain first.
  • SCD types:
  • Type 1 — overwrite, no history.
  • Type 2 — new row + effective_from/to, is_current.
  • Type 3 — column for previous value.
  • Type 4 — separate history table.
  • Type 6 — 1+2+3 hybrid.

11.3 Other modeling approaches

  • Data Vault 2.0 — Hubs (business keys), Links (relationships), Satellites (descriptive + history). Insert-only, auditable; great for raw enterprise lakes feeding star marts downstream.
  • One Big Table (OBT) — wide tables on Iceberg/BigQuery for ML/feature engineering.
  • Graph modeling — for lineage, entity resolution, knowledge graphs.

11.4 Modeling decision table

Need Choose
Auditable raw history Data Vault / event log
BI dashboards (fast, simple) Star schema
ML feature engineering OBT / feature store
Lineage queries Graph
Governance metadata Hybrid: relational core + graph for lineage

11.5 Q&A with reasoning

Q. What is grain and why does it matter? Grain is what one fact row represents. If grain is unclear, aggregations either double-count or lose meaning. Always define grain first.

Q. SCD Type 2 implementation in Spark + Iceberg? MERGE with two clauses: when matched and any tracked attr changed, close the existing row (is_current=false, effective_to=now()) and insert a new current row; when not matched, insert. Use a stable surrogate key separate from the natural key.

Q. When is Data Vault the right answer vs Kimball? Data Vault for raw, auditable, source-system-faithful layers in large enterprise lakes where multiple inconsistent sources land. Build Kimball star marts on top for analyst consumption.


12. Data Governance — Catalog, Quality, Lineage, Privacy

12.1 The six questions a governed platform must answer

  1. What data do we have? — catalog
  2. Who owns it? — stewardship
  3. Can I trust it?quality
  4. Where did it come from? — lineage
  5. Who can access it? — policy / RBAC / ABAC
  6. Is it sensitive?classification / PII detection

12.2 Architecture of a modern catalog

flowchart TB Sources["Sources
Iceberg / Snowflake / RDBMS / Kafka /
BI tools / Spark + Airflow runs"] -->|"crawlers + OpenLineage"| Ing["Ingest
schema, stats, runs, lineage"] Ing --> Ent[("Entity store
Iceberg / RDBMS")] Ing --> Idx[("Search index
Elasticsearch")] Ing --> Gph[("Graph
Neo4j / Neptune")] Ent --> Svc["Services:
Search · Lineage · Quality ·
Classification · Policy · Glossary · Stewardship"] Idx --> Svc Gph --> Svc Svc --> UI["UI / API / AI agents"]

Why three indexes? Because three different access patterns: - Entity store for canonical "what is this asset?" - Search index for "find me anything about X." - Graph for "what's upstream/downstream of X?"

12.3 Metadata types (memorize)

  • Technical — schema, types, partitioning, formats, sizes.
  • Business — glossary terms, domains, owners, descriptions.
  • Operational — ingestion times, freshness, last-updated, run logs.
  • Social — usage, ratings, certified status.
  • Lineage — who reads / writes / derives.
  • Quality — rule definitions, scores, incidents.

12.4 Data quality dimensions

Dimension Example check
Completeness % non-null on critical columns
Validity values match regex / domain / enum
Accuracy matches a reference (gold) source
Uniqueness PK has no duplicates
Timeliness data freshness < SLA
Consistency cross-table referential integrity
Reasonableness distribution / volume in expected range

12.5 Lineage granularity

  • Table-level — jobs read/write tables (cheap, coarse).
  • Column-level — which output column derives from which input columns. Built by parsing SQL or capturing Spark logical plans (OpenLineage's Spark integration).
  • Row-level — full provenance per record. Expensive; rarely worth it.

12.6 OpenLineage — the standard

Three core entities: Job, Run, Dataset (inputs/outputs). RunEvents emit start/complete/fail. Spark / Airflow / Flink / dbt have integrations. Backends: Marquez, Datahub, OpenMetadata.

12.7 PII classification — layered, not "just call the LLM"

flowchart LR Col["Column sample"] --> R["Regex
(SSN, email)"] R -->|"hit"| Done["Auto-tagged"] R -->|"miss"| Dict["Dictionary
(cities, names)"] Dict -->|"hit"| Done Dict -->|"miss"| ML["ML / NER classifier"] ML -->|"high conf"| Done ML -->|"ambiguous"| LLMJ["LLM judge
+ rationale"] LLMJ -->|"low conf"| Steward["Human steward
review"] Steward --> Done

Deterministic layers handle the obvious 80%; the LLM is only invoked for ambiguity, and a human approves low-confidence cases. This keeps cost, latency, and audit trails sane.

12.8 Policy enforcement

  • Tag-based access — "if tagged PII and user not in privacy_role, mask."
  • Row-level filters — tenant/region.
  • Column masking — hash, redact, tokenize.
  • Enforced in compute (Snowflake / Databricks / Trino) via integrations or proxies (Immuta, Privacera).

12.9 Q&A with reasoning

Q. RBAC vs ABAC — what's the practical difference? RBAC assigns permissions to roles; users inherit role permissions. ABAC evaluates attributes of subject, resource, and environment in a policy expression. ABAC composes better at scale because you don't get a role explosion when permissions depend on tenant, region, sensitivity, time, etc.

Q. What is a "data contract"? A versioned, testable agreement between producer and consumer covering schema, semantics, freshness, and SLAs. Enforced in CI: schema changes that break the contract fail the build.

Q. Why graph for lineage rather than recursive SQL? Multi-hop relationship traversal is what graph databases are built for. Recursive CTEs work for shallow lineage but cost grows poorly with depth and edge count.

Q. Where does OpenLineage capture column-level lineage? A Spark listener intercepts the logical plan before execution and emits a column-resolution graph. SQL parsing alone misses dynamic SQL or UDFs — the runtime listener is required for fidelity.


Part VII — Algorithms, Languages, Platform

13. Algorithms and Data Structures — The Patterns That Repeat

Don't memorize 200 problems. Learn the recognition table.

13.1 Pattern recognition

Signal in the problem Pattern
contiguous subarray / substring sliding window, prefix sum
sorted input or sorting helps two pointers, binary search
top-K / Kth largest / smallest heap
relationships / connectivity BFS, DFS, Union-Find, topological sort
optimal value with overlapping subproblems dynamic programming
enumerate all valid configurations backtracking
huge stream, bounded memory reservoir sampling, Bloom filter, HLL, Count-Min Sketch

13.2 Big-data sketches

  • HyperLogLog — approximate distinct count in ~16 KB at 1 % error. Used by Spark's approx_count_distinct.
  • Count-Min Sketch — approximate frequency. Heavy-hitter detection.
  • Bloom filter — probabilistic set membership. False positives possible, false negatives never. Used inside Parquet row groups, RocksDB, ad-blocking.
  • MinHash + LSH — near-duplicate detection at scale (entity resolution).

13.3 Patterns in code

Sliding window — longest substring without repeating chars (O(n))

def length_of_longest(s):
    seen = {}
    left = best = 0
    for right, c in enumerate(s):
        if c in seen and seen[c] >= left:
            left = seen[c] + 1
        seen[c] = right
        best = max(best, right - left + 1)
    return best

Top-K with a min-heap of size K (O(n log k))

import heapq

def top_k(stream, k):
    h = []
    for x in stream:
        if len(h) < k:
            heapq.heappush(h, x)
        elif x > h[0]:
            heapq.heapreplace(h, x)
    return sorted(h, reverse=True)

Why a min-heap of size k? The smallest of the top-K is on top; you only displace it if a new element is larger.

Union-Find with path compression + union by rank (~O(α(n)) per op)

class DSU:
    def __init__(self, n):
        self.p = list(range(n))
        self.r = [0] * n

    def find(self, x):
        while self.p[x] != x:
            self.p[x] = self.p[self.p[x]]   # path compression
            x = self.p[x]
        return x

    def union(self, a, b):
        a, b = self.find(a), self.find(b)
        if a == b:
            return False
        if self.r[a] < self.r[b]:
            a, b = b, a
        self.p[b] = a
        if self.r[a] == self.r[b]:
            self.r[a] += 1
        return True

Used for connectivity (lineage clusters, account merging, friend graphs).

Topological sort — Kahn's algorithm (O(V+E))

from collections import deque

def topo(graph):                      # graph: {u: [v, ...]}
    indeg = {u: 0 for u in graph}
    for u in graph:
        for v in graph[u]:
            indeg[v] = indeg.get(v, 0) + 1
            indeg.setdefault(u, 0)
    q = deque([u for u, d in indeg.items() if d == 0])
    order = []
    while q:
        u = q.popleft()
        order.append(u)
        for v in graph.get(u, []):
            indeg[v] -= 1
            if indeg[v] == 0:
                q.append(v)
    if len(order) != len(indeg):
        raise ValueError("cycle detected")
    return order

Used for job DAGs, compile dependencies, lineage recompute order.

LRU Cache (OrderedDict)

from collections import OrderedDict

class LRU:
    def __init__(self, cap):
        self.cap = cap
        self.d = OrderedDict()

    def get(self, k):
        if k not in self.d:
            return -1
        self.d.move_to_end(k)
        return self.d[k]

    def put(self, k, v):
        if k in self.d:
            self.d.move_to_end(k)
        self.d[k] = v
        if len(self.d) > self.cap:
            self.d.popitem(last=False)

13.4 Q&A with reasoning

Q. Why is heapifying a list O(n), not O(n log n)? Sift-down from the bottom up — leaves are already heaps, and most nodes are near the bottom (cheap sift). The geometric series sums to O(n).

Q. When does a Bloom filter return a false negative? Never. It only has false positives. That's why it's used as a negative cache — a "no" is reliable, a "yes" requires confirmation.

Q. BFS vs DFS for shortest unweighted path? BFS — it expands by distance layer, so the first time you reach the target you've used the fewest hops.

Q. Why does Dijkstra fail on negative edges? It commits to the shortest known distance once a node is finalized; a later negative edge could discover a shorter path. Use Bellman-Ford for negatives (and detect negative cycles).

Q. How does HyperLogLog estimate distinct counts? It hashes each element, looks at the run of leading zeros in the hash, and tracks the maximum across many "registers" partitioned by hash prefix. The harmonic mean of 2^max_zeros across registers estimates cardinality. ~16 KB gives ~1 % error for billions of unique elements.


14. Java and the JVM — What Production Engineers Must Know

14.1 Memory model

JVM Process
├── Method Area / Metaspace  (class metadata, off-heap)
├── Heap
│   ├── Young Gen (Eden + 2 Survivors)
│   └── Old Gen (Tenured)
├── Per-thread: PC reg, JVM stack, native stack
└── Direct memory (NIO, off-heap caches like Tungsten, Arrow)
  • Objects are born in Eden; survive minor GC → Survivor → eventually Tenured.
  • Metaspace replaces PermGen (Java 8+); it grows in native memory. Bound it with -XX:MaxMetaspaceSize.

14.2 Garbage collectors

GC Pause Throughput Heap sweet spot Use case
Serial High Low <100 MB Tiny / CLI
Parallel (Throughput) High Highest <8 GB Batch / Spark executors
G1GC Predictable Good 4–32 GB General services; default since Java 9
ZGC <1 ms Good 8 GB – 16 TB Low-latency services, very large heaps
Shenandoah <1 ms Good Large Latency-critical

Key flags: -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:InitiatingHeapOccupancyPercent=35.

14.3 Java Memory Model — happens-before

  • volatile → visibility + ordering, not atomicity. Use AtomicLong / LongAdder for atomic counters.
  • synchronized → mutual exclusion + happens-before.
  • final → safe publication of immutable fields after constructor exits.
  • LocksReentrantLock (fairness, tryLock), ReadWriteLock, StampedLock.

14.4 Concurrency toolkit

  • ConcurrentHashMap — striped locking pre-Java 8, CAS + bin-level sync from Java 8.
  • CompletableFuture — async pipelines (thenCompose, thenCombine).
  • Executors — newFixedThreadPool, ForkJoinPool (work-stealing), and virtual threads (Project Loom, Java 21) for massive I/O concurrency without callback hell.
  • Backpressure — bounded queues + RejectedExecutionHandler (CallerRuns is the safe default for ingestion).

14.5 Common pitfalls

  • Unbounded Executors.newCachedThreadPool → OOM under burst.
  • String.intern() in hot path → metaspace pressure.
  • synchronized on autoboxed Integer / Long → cross-thread aliasing bug.
  • Humongous allocations in G1 (objects ≥ 50% of region size) bypass Eden, hurt collection.
  • try-with-resources must close in reverse order — leaked file descriptors crash the JVM.
  • Virtual threads can be pinned by synchronized blocks holding I/O. Use ReentrantLock if blocking inside lock is expected.

14.6 Q&A with reasoning

Q. Why does GC affect p99 latency more than average? Stop-the-world pauses are visible only to requests unlucky enough to hit them. Increasing the heap reduces frequency of full GCs but increases their duration. ZGC / Shenandoah trade some throughput for sub-ms pauses.

Q. Why is volatile not enough for a counter? volatile gives visibility but not atomicity — count++ is read-modify-write, three steps, not atomic. Use AtomicLong (or LongAdder under high contention).

Q. PySpark executor is killed by YARN/K8s but heap looks fine. Why? Python worker memory and native libs (Arrow, ONNX, etc.) live in memoryOverhead, outside the heap. Increase spark.executor.memoryOverhead.

Q. What's a humongous object in G1? Object ≥ 50 % of region size. Allocated directly into old gen → may fragment regions, may trigger long full collections. Avoid by tuning region size or rewriting to chunked processing.


15. Python — What Matters for Data Engineering

15.1 Object model

In CPython, almost everything is an object. Variables hold references, not values. b = a copies the reference, not the underlying object — that's why mutation is visible through both names.

15.2 The GIL

Only one thread executes Python bytecode at a time in CPython.

Workload Tool
CPU-bound math multiprocessing.Pool, NumPy/Numba (release the GIL)
I/O bound (HTTP, DB) asyncio + aiohttp/asyncpg
Mixed concurrent.futures.ThreadPoolExecutor for simplicity
Big data PySpark (Python driver, JVM executors)

15.3 Iterators, generators, async

def chunks(iterable, n):
    it = iter(iterable)
    while True:
        chunk = list(islice(it, n))
        if not chunk:
            return
        yield chunk

Generators are lazy → constant memory for streaming ETL. asyncio is great for I/O concurrency; never call blocking I/O inside an async function without run_in_executor.

15.4 PySpark specifics

  • Driver in Python ↔ JVM via Py4J. Executors run Python workers per task.
  • Python UDFs serialize each row → 10–100× slower than DataFrame ops.
  • Pandas UDFs (vectorized) with Apache Arrow run on columnar batches → 10–100× speedup.
  • Avoid collect() on big DFs; use toLocalIterator().

15.5 Idiomatic performance

  • Vectorize with NumPy/Pandas; avoid per-row Python loops.
  • dict.get, setdefault, defaultdict, Counter.
  • __slots__ to remove __dict__ overhead for millions of objects.
  • functools.lru_cache for memoization; cached_property for instance-level.

15.6 Q&A with reasoning

Q. Why does len({1, True, 1.0}) == 1? 1 == True == 1.0 and their hashes are equal, so the set treats them as the same key.

Q. Why is mutable default arg def f(x=[]) dangerous? The default list is created once at function definition time. Each call mutating it accumulates state across calls. Use x=None and x = x or [] inside.

Q. Why are dicts and sets average O(1)? Open addressing hash table with quadratic probing; resizes when load factor > ~⅔ to keep clusters short.

Q. Why does pandas.DataFrame.apply not release the GIL? Pandas applies the user function row-by-row in Python; the function executes Python bytecode. NumPy releases the GIL because its vectorized ops are in C.


16. Cloud, Containers, Kubernetes — Just Enough

flowchart TB Code["Code"] --> Img["Container image
(Docker / OCI)"] Img --> Reg["Registry"] Reg --> K8s["Kubernetes cluster"] K8s --> Dep["Deployment
(replicas, rolling update)"] Dep --> Pod1["Pod 1
(1+ containers)"] Dep --> Pod2["Pod 2"] Svc["Service
(stable virtual IP)"] --> Pod1 Svc --> Pod2 Ing["Ingress / Gateway API
(L7 routing)"] --> Svc

16.1 What bites engineers in production

  • Requests vs limits — requests reserve capacity, limits cap usage. Set both, especially on memory.
  • Liveness vs readiness probes — readiness gates traffic; liveness restarts crashed pods. Mixing them up kills healthy pods under load. Add a startup probe for slow-starting apps to avoid liveness false-positives during boot.
  • Spark on K8s — each executor is a pod; use dynamic allocation with shuffle tracking. Pin to nodepools with taints; spot for cost.
  • Secrets — don't bake them into images; mount via Secret/ExternalSecret/CSI driver.
  • Workload identity — prefer per-workload cloud identities (IRSA / Azure WI / GCP WIF) over long-lived keys.
  • Image supply chain — multi-stage builds, distroless base, SBOM (Syft), signing (cosign), admission control.

16.2 Cloud building blocks (cross-cloud)

Layer AWS Azure GCP
Object storage S3 ADLS Gen2 / Blob GCS
Compute (VM) EC2 VM Compute Engine
Containers EKS / ECS / Fargate AKS / ACA / ACI GKE / Cloud Run
Function Lambda Functions Cloud Functions
Spark EMR / Glue / Databricks HDInsight / Synapse / Databricks Dataproc / Databricks
DW Redshift Synapse BigQuery
Streaming Kinesis / MSK Event Hubs / Service Bus Pub/Sub
Catalog Glue Data Catalog Purview Dataplex
Vector OpenSearch / Aurora pgvector AI Search Vertex AI Search / AlloyDB

16.3 Observability trinity

  • Metrics — Prometheus + Grafana (or Cloud Monitoring).
  • Logs — Loki / OpenSearch / Cloud Logging.
  • Traces — OpenTelemetry SDKs → Jaeger / Tempo / Cloud Trace.
  • GenAI semantic conventions (OTel) for tracing prompts/retrievals/tool calls.

17. CI/CD and Testing — Confidence to Ship

A complete test pyramid for a data service:

flowchart TB U["Unit tests
(fast, pure)"] --> I["Integration tests
(Testcontainers: Postgres, Kafka, S3-mock, Spark)"] I --> C["Contract tests
(producer ↔ consumer, Pact)"] C --> D["Data quality tests
(Great Expectations, Soda)"] D --> E2E["E2E + smoke
(against staging)"] E2E --> Eval["AI / agent evals
(Ragas, golden sets, LLM-as-judge)"]

17.1 What each layer guards against

  • Unit — pure logic regressions. Fast feedback.
  • Integration — real DB, Spark, Kafka, S3 (LocalStack/MinIO). Most production bugs live at boundaries.
  • Contract — schema and semantic agreements between services.
  • Data quality — datasets failing even when code passes (schema drift, missing rows, distribution change).
  • E2E — system behavior under realistic flows.
  • AI evals — faithfulness, recall@k, tool-call correctness; gate merges on regression.

17.2 CI/CD patterns worth defending

  • Trunk-based development + short-lived PR branches.
  • Caching of Maven/Gradle/uv lockfile-based downloads (saves minutes).
  • Reproducible builds — pinned base images, lockfiles, SLSA provenance.
  • Security gates — SAST (Semgrep), dependency scan (Dependabot/Renovate), secrets scan (gitleaks), container scan (Trivy/Grype), SBOM (Syft) + signing (cosign).
  • Deploy — GitOps (Argo CD / Flux); progressive delivery (Argo Rollouts canary/blue-green); auto-rollback on SLO regression.

17.3 Quality gates worth setting

  • ≥ 80 % coverage on changed files (not absolute).
  • 0 critical CVEs in shipped image.
  • Mutation score on critical modules ≥ 60 %.
  • Performance regression budget on integration tests.
  • Contract test on every consumer/provider before deploy.

Part VIII — Putting It Together

18. One Coherent Mental Map

flowchart LR subgraph Ingest K["Kafka / CDC"] --> S["Spark / Flink
streaming ingest"] end subgraph Storage S --> Ice[("Iceberg lakehouse
(audit / replay)")] S --> G[("Graph
(lineage)")] S --> ES[("Search index")] S --> V[("Vector index")] end subgraph Serve Ice --> Q["Query / API"] G --> Q ES --> Q V --> R["RAG / Agents"] Q --> R end subgraph Govern Ice --> GovSvc["Catalog · DQ · Classification ·
Lineage · Policy · Stewardship"] G --> GovSvc ES --> GovSvc GovSvc --> R end R --> UI["UI · Notebooks · Apps"] Q --> UI

One sentence per layer:

  • Ingest — absorb data with backpressure and idempotent / exactly-once semantics.
  • Storage — keep cheap durable truth in a lakehouse; add specialized indexes (graph, search, vector) that can be rebuilt from the lake.
  • Serve — expose query APIs and AI features that combine retrieval, traversal, and generation.
  • Govern — make every asset discoverable, trusted, and auditable; let humans approve what models can't.

19. The Ten Self-Check Questions

Answer each out loud using the What → Why → How → Trade-off → Failure → Example framework. If you can't, that's the topic to revisit.

  1. Walk a Spark query through Catalyst → Tungsten → AQE → physical execution. Where do plans change at runtime, and why?
  2. Why is shuffle the dominant cost in distributed query engines, and how does push-based shuffle improve it?
  3. Explain Iceberg's commit protocol and how two concurrent writers reconcile.
  4. Compare column-level lineage in a graph DB vs recursive SQL at 100 M edges. Where does each break?
  5. Derive the cost of self-attention and explain KV cache + FlashAttention.
  6. Design a multi-tenant RAG system: where do you enforce tenant isolation, and how do you defend against indirect prompt injection?
  7. Sketch an idempotent CDC pipeline Postgres → Kafka → Iceberg with exactly-once semantics. Where can duplicates appear, and how do you dedupe?
  8. Choose between broadcast hash, shuffle hash, and sort-merge join — what stats and runtime signals drive AQE's decision?
  9. Explain the outbox pattern and why dual-write is unsafe without it.
  10. Layered PII classification: what's deterministic, what's ML, what's an LLM, and where does a human approve?

20. Closing — How to Compound These Ideas

The point of going this deep isn't trivia. Once you internalize why each component exists, you can compose them. A new architecture is just an old set of trade-offs in a new shape. If you can articulate the original problem, the rest is engineering.

Three habits make this compound:

  1. Speak every concept out loud with the What → Why → How → Trade-off → Failure → Example framework. Silent reading creates a false sense of mastery.
  2. For each topic, draw the flow before answering. The diagram forces you to commit to a structure; the words follow.
  3. For each "why," chase one more "why." "Shuffle is expensive." → Why? "Network + sort + spill + reducer pressure." → Why does that hurt p99? "One straggler bounds stage completion." → How do I see it? "Spark UI: max vs median task time." That fourth-level answer is what senior interviewers listen for.

Anything weaker than that, treat as a draft.


Part IX — In-Depth Quiz Bank with Reasoning

Each quiz item below is answer + reasoning, not just the answer. Speak each one out loud. If you only know the answer but not the why, you don't know it yet.

21. Spark Core — Quiz With Reasoning

  1. Default spark.sql.shuffle.partitions is 200. Historical safe middle ground. Modern jobs resize per data volume so each shuffle partition is ~128–256 MB. AQE coalesces tiny partitions automatically, so 200 is "OK" out of the box but rarely optimal.
  2. Wide vs narrow. groupByKey is wide because rows for the same key must meet on one reducer, requiring shuffle. map, filter, mapPartitions are narrow because each output partition derives from one input partition.
  3. reduceByKey vs groupByKey().mapValues(_.sum). reduceByKey does map-side combine — it sends only partial sums across the network. groupByKey ships every value, blowing up shuffle and reducer memory. Same result, very different cost.
  4. spark.sql.adaptive.enabled defaults to true in 3.5. AQE coalesces tiny partitions, switches sort-merge to broadcast when actual stats are small, and splits skewed partitions. Default-on enables runtime adaptation without manual hints.
  5. MEMORY_ONLY storage level. Stores deserialized objects in heap. Fastest read but heaviest memory; eviction on memory pressure. Prefer MEMORY_AND_DISK_SER for large reused DataFrames so spilled partitions are still recoverable.
  6. Unit of parallelism = partition. One task per partition per stage. Total parallelism scales with partitions × cores per executor × number of executors.
  7. persist vs checkpoint. persist keeps lineage; if a partition is lost, Spark recomputes from lineage. checkpoint writes to reliable storage and truncates lineage — critical for very long chains and iterative algorithms where recomputation would be catastrophic.
  8. Broadcast variable vs collect. Broadcast distributes a value once per executor; tasks read it locally. collect returns the entire DataFrame to the driver, paying full network cost and risking driver OOM.
  9. df.explain(true). Shows Parsed, Analyzed, Optimized logical plans and the Physical plan. Use it to confirm predicate pushdown, projection pruning, partition filters, and join strategy.
  10. Whole-stage codegen disabled when. Inside UDFs (non-codegen), with > 100 inputs/outputs, hitting fallback rules, or when spark.sql.codegen.wholeStage=false. Codegen merges operators into one Java function; UDFs are opaque to Catalyst, so it cannot fuse them.

22. Spark Tuning — Quiz With Reasoning

  1. autoBroadcastJoinThreshold default = 10 MB. Keeps broadcasts safe by default. Stats outside this size disable broadcast; you can hint broadcast(df) if Catalyst underestimates.
  2. Disable broadcast for one join. Set spark.sql.autoBroadcastJoinThreshold=-1, or use the /*+ NO_BROADCAST_HASH(t) */ hint per join (3.4+). Why: if the "small" side is large in reality, broadcast OOMs.
  3. spark.sql.adaptive.skewJoin.enabled. Splits a skewed partition into sub-partitions and replicates the other side, shifting work from one slow reducer to several fast ones — task runtime is bounded by the slowest task, so cutting stragglers wins.
  4. Small-files threshold ~ < 64 MB. Below this, listing/opening overhead dominates read time on object stores. Compaction targets 128–512 MB for analytics.
  5. Default write format = Parquet since Spark 2.0. Columnar, predicate pushdown, footer stats, splittable.
  6. coalesce(1) vs repartition(1). coalesce avoids shuffle and may produce uneven partitions; repartition(1) triggers a full shuffle to evenly redistribute then collapse. Use coalesce for small downsizing without shuffle, repartition to force redistribution.
  7. Best storage for a very large reused DF. MEMORY_AND_DISK_SER. Serialized form fits more rows per byte and spills under pressure instead of OOM.
  8. GC time signal in Spark UI. Stages/Executors GC time column; if more than ~10 % of task time, tune memory/allocation (more partitions, smaller cache, raise overhead).
  9. Approximate distinct. approx_count_distinct uses HyperLogLog++. Exact distinct requires shuffling all values; HLL needs KB-sized sketches at ~1 % error.
  10. Bucketing vs broadcasting. For repeated large-large joins on the same key, bucketed tables avoid shuffle every time the join runs. Broadcasting only works for small dimensions.

23. SQL / RDBMS — Quiz With Reasoning

  1. Default isolation: PG = Read Committed; MySQL InnoDB = Repeatable Read. Read Committed sees only committed values per statement, allowing non-repeatable reads but maximizing concurrency. MySQL's Repeatable Read uses gap locking to prevent phantoms in many cases.
  2. Covering index. Index that contains every column the query needs, so the engine answers directly from the index without a heap fetch. Why it's faster: random heap I/O is the slowest part of an index lookup.
  3. LEFT JOIN vs LEFT SEMI JOIN. LEFT JOIN keeps both tables' columns and emits NULLs for non-matches. LEFT SEMI returns only the left rows where a match exists; useful for existence filters where you don't need right-side columns.
  4. VACUUM FULL vs VACUUM. Regular VACUUM marks dead tuples reusable. VACUUM FULL rewrites the table and indexes to reclaim space; takes ACCESS EXCLUSIVE lock and rewrites blocks.
  5. Prefix search index in Postgres. B-tree with text_pattern_ops for LIKE 'foo%' because it sorts lexicographically; trigram GIN supports more flexible patterns.
  6. MySQL InnoDB clustered index. The PRIMARY KEY — table rows are physically stored in PK order. Choosing a wide or non-monotonic PK fragments inserts.
  7. Phantom-read prevention. Serializable isolation; PG's Repeatable Read uses Serializable Snapshot Isolation (SSI) to detect conflicts and abort.
  8. Leftmost-prefix rule. Composite index (a, b, c) is usable for filters on a, (a, b), or (a, b, c) — not for b alone — because B-tree ordering is anchored on the leading column.
  9. BIGINT vs UUID PK. BIGINT is small, sequential, fast to insert. UUIDv4 is globally unique without a central allocator but destroys clustered-index locality. Use ULID or UUIDv7 for time-ordered uniqueness.
  10. EXPLAIN (BUFFERS). Shows shared/local/temp buffer hits, reads, dirtied pages per node — the actual I/O profile, not just the cost estimate.

24. Distributed Storage — Quiz With Reasoning

  1. S3 max object size = 5 TB. Single-object cap; multipart upload is required above 5 GB so failed chunks can be retried.
  2. HDFS default block size = 128 MB. Large blocks reduce NameNode metadata pressure and amortize seek cost over reads.
  3. Spark on S3 committer (2026). S3A magic committer. It uses S3 multipart-upload metadata to atomically publish task outputs and avoids slow rename. Classic FileOutputCommitter is unsafe with speculative execution because two attempts can both produce the same output.
  4. LIST consistency on S3. Strongly consistent since Dec 2020. Before that, eventual consistency caused stale listings; lakehouse formats sidestep listings via metadata files.
  5. Cheapest archival tier on AWS. S3 Glacier Deep Archive. Hours to retrieve; 180-day minimum storage. Use for compliance archive, not active reads.
  6. Speculative execution risk on S3. Two attempts may write the same task output. The classic committer's rename can leave duplicates; the magic committer assigns unique multipart paths and commits via metadata.
  7. HDFS Erasure Coding RS-6-3. Stores 6 data + 3 parity blocks; effective replication ratio = 1.5×. Saves storage at CPU cost for encoding/decoding.
  8. ADLS protocol prefix in Spark. abfss:// (secure). Use it instead of older wasbs:// Blob URIs.
  9. Parquet has native column-level encryption (Modular Encryption since 1.12). Allows per-column keys and integrity checks for governance.
  10. Best partition column for daily logs. date (low cardinality, time-aligned). Avoid high-cardinality columns like user_id to prevent partition explosion.

25. Iceberg / Lakehouse — Quiz With Reasoning

  1. Iceberg atomic commit mechanism. Catalog-level atomic swap (HMS / Glue / REST / Nessie). The catalog provides compare-and-set on the table pointer; readers see only committed snapshots.
  2. Manifest list contents. Pointers to manifest files for a snapshot, with partition-summary stats used for partition-level pruning before reading the manifests themselves.
  3. Iceberg schema evolution by column ID, not name. Renames preserve identity; old data still maps to the right logical column.
  4. expire_snapshots. Removes old snapshots beyond retention and unreferenced data files. Without it, table metadata grows indefinitely and storage cost balloons.
  5. Hudi CoW vs MoR. Copy-on-Write rewrites whole files on update; reads stay simple but writes are heavier. Merge-on-Read appends delta files merged at read; faster writes, slower reads, periodic compaction needed.
  6. Delta transaction log location. _delta_log/ directory at table root holding JSON commits and Parquet checkpoints.
  7. Default Iceberg catalog standard in 2026. REST catalog. Cross-engine, cloud-native, language-agnostic.
  8. Iceberg time-travel SQL. FOR VERSION AS OF <id> or FOR TIMESTAMP AS OF <ts>. Engines resolve via snapshot history.
  9. Z-order improves multi-column data skipping by spatial locality. Records with similar values land in the same files, so predicate pruning skips more files.
  10. Equality vs position deletes. Position deletes point at file + row and are cheap to apply at read. Equality deletes specify a predicate and require a join at read time, more expensive but more flexible (good for streaming CDC).

26. Data Modeling — Quiz With Reasoning

  1. Grain. The level of detail one fact-table row represents. Grain decisions drive every aggregation. Mixing grains causes silent double counting.
  2. Star vs snowflake. Snowflake normalizes dimensions into multiple tables; star keeps them flat. Star = simpler, fewer joins, faster scans. Snowflake = less duplication, harder reads.
  3. SCD Type 2. Track full history with effective_from, effective_to, is_current. Each change creates a new row; old rows remain for point-in-time queries.
  4. Surrogate keys. Insulate fact from natural-key changes, allow SCD2, enable smaller integer joins. Natural keys are still stored for traceability.
  5. Bridge table. Resolves many-to-many between fact and dimension. Example: a healthcare fact links to multiple diagnoses through a bridge.
  6. Additive measures sum correctly across all dimensions. Semi-additive (e.g., balance) sums across some dimensions; non-additive (e.g., ratios) needs special handling.
  7. Junk dimension. Combines low-cardinality flags into one dim to avoid many tiny dims and reduce fact-table key counts.
  8. Conformed dimension. Same dim used identically across multiple fact tables, enabling cross-process analysis.
  9. Factless fact table. Records events with no measure column. Example: student-attendance event used for counts.
  10. Late-arriving dimension fix. Insert "Unknown" placeholder in the dim immediately, update the fact with the proper key when the dim arrives, or look up the natural key with a backfill job.

27. Graph Databases — Quiz With Reasoning

  1. Index-free adjacency. Each node stores direct pointers to its relationships. Once you anchor a query at a node, traversal is pointer-following, not global index lookup.
  2. OPTIONAL MATCH returns NULLs when patterns are absent, similar to LEFT OUTER JOIN.
  3. Uniqueness constraint. CREATE CONSTRAINT FOR (d:Dataset) REQUIRE d.id IS UNIQUE. Enforces a single dataset per id and creates a backing index.
  4. *1..5. Variable-length path of 1 to 5 hops. Always bound length to avoid traversal explosion.
  5. Relationship direction. Stored directed; queries can ignore direction with -[r]-.
  6. Neptune query languages. openCypher and Gremlin; SPARQL via a separate endpoint.
  7. One-hop neighbors. O(degree). For supernodes this is huge; bound traversal or filter by relationship type.
  8. Plan inspection. EXPLAIN (no execution) and PROFILE (execution + DB hits).
  9. Supernode. A node with millions of relationships; traversals through it explode. Mitigate by tag/tenant partitioning, indexed predicates, or relationship-level filters.
  10. Best for in-memory graph workloads. Memgraph (or Neo4j tuned with sufficient page cache).

28. Elasticsearch — Quiz With Reasoning

  1. Default scoring = BM25 since 5.x. Improves over TF-IDF by saturating term frequency and normalizing for document length.
  2. text vs keyword. text is analyzed (tokenized) for full-text search; keyword stored as-is for exact match, sorting, aggregations. Most catalog fields need both via multi-field mapping.
  3. JVM heap ≤ 32 GB. Above 32 GB the JVM uses 64-bit object pointers; compressed oops are lost, raising memory cost. Reserve remaining RAM for OS file cache (Lucene loves it).
  4. Translog. Per-shard write-ahead log replayed on crash. Tunable via durability settings; flush makes the shard durable to disk.
  5. Default refresh interval = 1 second. Set to -1 during bulk indexing for throughput, then restore.
  6. Change primary shards after creation? Not directly. Use reindex into a new index with desired shards, or shrink/split.
  7. Phased rollover indexes. Index Lifecycle Management (ILM) with rollover by size/age/docs.
  8. Cheap top-K aggregations. terms aggregation with size and shard_size tuned. Be careful with high-cardinality fields; use composite aggregation for paging.
  9. Vector search type. Dense vector with k-NN (HNSW indexing, since 8.x).
  10. match vs term. match analyzes the query text (tokenizes) before matching. term does exact match on the indexed token; use only on keyword fields.

29. LLMs — Quiz With Reasoning

  1. Self-attention complexity. . Each token attends to every other token, and each attention head computes a -dimensional dot product. Long context blows up compute and memory.
  2. Default tokenizer for Llama family. SentencePiece BPE with byte fallback. Byte fallback ensures any text encodes, even unseen scripts/symbols.
  3. Decoder-only models. GPT, Llama, Mistral, Claude, Gemini. They generate left-to-right with causal attention and dominate chat/agents.
  4. Cosine vs dot. Without normalization, dot favors longer vectors. After normalization, cosine equals dot. Most embedding APIs return normalized vectors.
  5. Most popular ANN index in 2026 = HNSW (hierarchical navigable small world). Good recall vs latency, supports filtered search.
  6. Continuous batching tool. vLLM with PagedAttention. New requests slot into a running batch, raising throughput dramatically.
  7. SFT vs RLHF/DPO. SFT trains the model to imitate reference answers. RLHF/DPO trains it to prefer outputs humans rated higher. RLHF/DPO improves helpfulness and safety alignment beyond imitation.
  8. System prompt. Top-of-conversation instruction defining role, constraints, format. Treated as trusted context; never let retrieved data override it.
  9. KV cache contents. Per-layer attention K and V tensors for all already-generated tokens. Memory grows linearly with batch × layers × seq length.
  10. Smallest GPU tier where Llama-3-70B Q4 fits. ~40 GB (e.g., A100 40 GB or H100 partition with offload), or 2×24 GB consumer GPUs with tensor parallel.

30. RAG and Agents — Quiz With Reasoning

  1. RRF = Reciprocal Rank Fusion. Combines two ranked lists by summing . Good when scores are not directly comparable (BM25 vs cosine).
  2. HyDE = Hypothetical Document Embeddings. Have the LLM imagine an answer, embed it, and retrieve real docs by similarity to that. Helps when queries are short and ambiguous.
  3. Reranker purpose. Improves top-K precision by re-scoring candidates with a more expensive cross-encoder.
  4. Vector DB graph index = HNSW.
  5. Long-context failure that RAG sidesteps. "Lost in the middle" — recall drops for content placed in the middle of long contexts. RAG sidesteps it by retrieving only what is needed.
  6. Ragas faithfulness metric. Measures whether the generated answer is supported by retrieved context. Low faithfulness = hallucination.
  7. ReAct = Reasoning + Acting interleaved with tools. The model reasons, calls a tool, observes the result, and continues.
  8. Function-calling output. Model emits a structured JSON object matching the declared schema; your code performs the actual call. That separation is what lets you sandbox tools.
  9. Tenant filters in RAG. Apply pre-retrieval as a metadata filter; assert again pre-generation. Post-only filtering can leak counts/snippets and waste compute.
  10. Chunk size. ~500–1,000 tokens with 10–20 % overlap. Big enough for context, small enough for precise retrieval.

31. Distributed Systems HLD — Quiz With Reasoning

  1. CAP under partition. You must choose C or A. Partition tolerance is mandatory for distributed systems. Real systems also obey PACELC: when no partition, choose Latency vs Consistency.
  2. Quorum for strong consistency in N=5. . Examples: or . Overlapping quorums guarantee a read sees the latest write.
  3. Consistent hashing benefit. Re-shards minimally when nodes are added/removed (only keys move). Virtual nodes balance load.
  4. Outbox pattern. Atomically write DB row + outbox event in same transaction; relay publishes events to broker. Solves dual-write inconsistency.
  5. At-least-once vs exactly-once. At-least-once may deliver duplicates. "Exactly-once" is really at-least-once + idempotent consumer or transactional sink.
  6. 2PC vs Saga. 2PC is synchronous, blocking, strong but fragile. Saga is async, eventual, with compensating actions; better for microservices.
  7. Postgres CDC standard. Logical replication / WAL via Debezium emitting Avro/JSON/Protobuf.
  8. Avoid hot key on a global counter. Bucket/salt the key, distribute across many partitions, aggregate periodically.
  9. Raft fault tolerance. Need 3 nodes to tolerate 1 failure. Quorum is .
  10. Idempotency-key location. Header or body, validated server-side; persisted with the record for dedupe across retries.

32. Algorithms and Data Structures — Quiz With Reasoning

  1. HLL memory for ε = 0.01 ≈ 16 KB. HyperLogLog uses leading-zeros of hashes across registers; fewer registers means more error.
  2. Bloom filter and false negatives. Never. Insertion sets all hash bits; if any bit is missing during lookup, the item was definitely never inserted. False positives only.
  3. Heap from list complexity. O(n) using heapify (sift-down from middle). Pushing one-by-one is O(n log n).
  4. Best sort for nearly-sorted data. Insertion sort (O(n) when nearly sorted), or Timsort (Python default) which detects natural runs and merges.
  5. Path compression cost. Effectively ≈ constant. Combined with union-by-rank.
  6. Trie space for 10k words avg length 8. ~80k nodes worst case, often less with shared prefixes. Use compact trie / DAWG to save more.
  7. Dijkstra failure. Negative edges; use Bellman-Ford for negative weights (no cycles), or Johnson's for sparse graphs.
  8. Shortest unweighted path = BFS. Each level expansion equals one hop, so first time you reach a node is via the shortest path.
  9. HLL k. Number of registers. Typical registers gives ~1 % error.
  10. Quickselect. Average O(n), worst O(n²). Median-of-medians gives O(n) worst case.

33. Java / JVM — Quiz With Reasoning

  1. Default GC in Java 11 = G1GC. Java 9 made G1 default because it provides predictable pause times via region-based collection, suiting server workloads with multi-GB heaps. Parallel GC maximizes throughput but has long pauses, hurting p99. ZGC/Shenandoah trade some throughput for sub-millisecond pauses.
  2. Two ways to safely publish a value without synchronized. A volatile field gives happens-before for visibility/ordering. A final field gives safe publication after the constructor finishes. AtomicReference provides atomic CAS.
  3. -XX:+HeapDumpOnOutOfMemoryError writes a .hprof when the JVM throws OutOfMemoryError. Heap dumps captured at the moment of failure expose the real retained set, hard to reconstruct later.
  4. ConcurrentHashMap.size() complexity. O(n) approximation aggregated via sumCount over LongAdder-style cells. Bin-level CAS removed the global lock, so a coherent size requires summing per-bin counters.
  5. Runnable vs Callable. Runnable.run() returns void and cannot throw checked exceptions. Callable<T>.call() returns T and can throw. Executors use Callable and surface results via Future.
  6. Humongous region in G1. A region holding an object ≥ 50 % of region size, allocated directly into old gen. Many humongous allocations indicate big arrays/strings that should be streamed.
  7. LongAdder vs AtomicLong. LongAdder shards the counter into multiple cells, reducing CAS contention; reads sum cells. Use it for high write contention; AtomicLong when reads must be exact and writes are infrequent.
  8. Effectively immutable class. All fields final, no leaking this from constructor, defensive copies of mutable refs, no mutator methods. JMM safe-publication guarantees other threads see constructor-completed state.
  9. Finalizer thread. A JVM-internal thread that processes Object.finalize() queue. Deprecated for removal because finalization is unpredictable, can resurrect objects, and runs on a single bottleneck thread. Prefer try-with-resources and Cleaner.
  10. Print all GC events with timestamps (Java 9+). -Xlog:gc*=info:file=gc.log:time,uptime,level,tags. Unified logging replaces fragmented Java 8 flags.

34. Python — Quiz With Reasoning

  1. bool([]) is False. Empty containers are falsy; __bool__ falls back to __len__, which returns 0. That's why if not items: is idiomatic.
  2. a = 1000; b = 1000is vs ==. == is True. is may be False because identity holds only when CPython interns the value; only small ints in [-5, 256] are guaranteed cached. Avoid is for value comparison.
  3. *args, **kwargs. Desugar to a positional tuple and keyword dict. Useful in decorators with functools.wraps.
  4. Hashable class. Implement __hash__ and __eq__ consistently. Hashable classes should be effectively immutable; mutating after insertion into a set/dict corrupts the structure.
  5. Shallow vs deep copy. Shallow copy.copy duplicates the outer container, sharing inner references. Deep copy.deepcopy recursively duplicates. Choose based on whether mutation of inner objects must be isolated.
  6. yield from delegates iteration, exceptions, send, and return-value of a sub-generator. Composes generators without re-implementing the protocol.
  7. Mutable default argument trap. Default values evaluate once at function-definition time. def f(x=[]) shares one list across calls. Use def f(x=None): x = x or [].
  8. len({1, True, 1.0}) == 1. 1 == True == 1.0, and their hashes match (hash(1) == hash(1.0) == hash(True)), so the set collapses them.
  9. Membership: list vs set. set is O(1) average via hashing; list is O(n).
  10. @functools.wraps. Preserves wrapped function's __name__, __doc__, __qualname__, __module__, __wrapped__. Without it, decorators hide identities and break introspection.

35. OOP / LLD Patterns — Quiz With Reasoning

  1. if-instanceof cascade violates OCP (and often LSP). OCP says behavior should extend through new types, not edits to existing code. The cascade adds new behavior by editing one method.
  2. Strategy vs State. Strategy = caller picks an algorithm and swaps it. State = the object's behavior changes due to internal lifecycle. Strategy fits pluggable PII classifiers; State fits a job FSM (PENDING → RUNNING → SUCCEEDED).
  3. Add tracing without modifying repositories. Decorator pattern wraps a repository implementing the same interface. Proxy is similar but typically adds control or remote concerns.
  4. Singleton trade-off. Hidden global state, hard to mock, tests interfere across cases, parallel tests become unsafe. Mitigate by injecting via a DI container with proper scoping.
  5. DTO location in hexagonal architecture. Application layer (input/output boundary) or shared kernel; never in domain. The domain must remain free from delivery and persistence concerns.
  6. Aggregate vs Entity. An aggregate is a cluster of entities and value objects with one root that enforces invariants. An entity has identity but no transactional boundary by itself.
  7. Visitor adds operations easily but resists adding new types. True. New operation = new Visitor; new type forces editing every Visitor.
  8. Build a complex immutable object with optional fields → Builder. Telescoping constructors and many setters break immutability and readability; Builder validates and constructs once with .build().
  9. CoR vs Pipeline. Chain of Responsibility lets any handler short-circuit by handling and returning. Pipeline always passes through every stage transforming data. CoR for validation/auth chains; pipeline for ETL.
  10. Composition over inheritance. Composition gives flexibility, supports delegation, avoids fragile-base-class problems, and simplifies testing because dependencies can be swapped.

36. Cloud & Kubernetes — Quiz With Reasoning

  1. K8s scheduler factors. Resource requests + node fit + affinity + taints/tolerations + topology spread. Scheduling failures show in Events.
  2. Requests vs limits. Requests = guaranteed (used by scheduler); limits = maximum allowed. Memory beyond limit → OOMKilled; CPU beyond limit → throttled.
  3. Memory limit exceeded. OOMKilled by kernel cgroups; Kubernetes restarts the container per restart policy.
  4. PodDisruptionBudget. Maximum number of pods that can be voluntarily disrupted at once during drains/upgrades. Protects availability.
  5. Run-to-completion = Job (and CronJob for scheduled). Track succeeded/failed counts; clean up with TTL controller.
  6. Mounting secrets as files. Files allow stricter perms, atomic rotation, and avoid leakage in process listings, container metadata, or crash dumps.
  7. In-cluster service type = ClusterIP (default).
  8. AWS pod-to-IAM mapping = IRSA (IAM Roles for Service Accounts) using OIDC trust between EKS and IAM.
  9. Image layer storage. Tar layers stored content-addressed by sha256 digest; identical layers shared between images.
  10. kubectl drain safely evicts pods from a node before maintenance, respecting PDBs and graceful termination.

37. CI/CD & Testing — Quiz With Reasoning

  1. JUnit 5 lifecycle hook before all tests = @BeforeAll (static method), runs once per test class. Use it for shared, expensive resources like Testcontainers.
  2. @Mock vs @Spy. @Mock returns defaults for unstubbed methods; @Spy wraps a real object so unstubbed methods run real code. Spies are useful for partial mocking but easy to misuse.
  3. PyTest fixture scopes. function (default), class, module, session. Wider scopes save setup time but require careful state management.
  4. Selenium W3C standard. Standardized WebDriver protocol; replaced JSON Wire and unified browser drivers.
  5. Testing code with Thread.sleep. Inject a clock or scheduler abstraction; never sleep in tests. Sleeps cause flakiness.
  6. assertDataFrameEqual ignores row order by default (sorts before comparing). Use checkRowOrder=True if order matters.
  7. SBOM tool = Syft (or Trivy SBOM, Anchore). SBOMs enable vulnerability scanning and supply-chain audits.
  8. SAST vs DAST. SAST analyzes source code at rest; DAST tests a running app. Both are needed; neither alone catches all classes of bugs.
  9. Flaky-test budget. Allowed % of intermittent failures before a test is auto-quarantined. Forces teams to fix flakes rather than ignore them.
  10. Trunk-based development. Faster integration, fewer merge conflicts, continuous delivery friendly. Long-lived branches drift and break.

38. Data Governance — Quiz With Reasoning

  1. OpenLineage core entities = Job, Run, Dataset. Lineage must represent what code (Job) ran in a specific Run, reading/writing which Datasets.
  2. Lineage granularity = table + column level. Column-level requires SQL parsing or runtime plan capture; table-level is the fallback.
  3. Catalog vs metastore. Catalog = governance/business metadata + UX. Metastore = technical schema registry used by query engines (Hive metastore, Glue).
  4. Current vs historical state. Both. Current state for query; snapshots/history for audit and time-travel.
  5. Glossary term vs tag. Glossary terms are curated business vocabulary with definitions and ownership. Tags are lightweight labels.
  6. Lineage edge storage. Graph DB (Neo4j/Neptune) for traversal; or columnar with adjacency for very large scale; often both with replay from raw events.
  7. DQ rule types. Completeness, uniqueness, validity, freshness, referential integrity, reasonableness.
  8. RBAC vs ABAC. RBAC assigns by role; ABAC composes attributes (subject + resource + env). ABAC scales better at enterprise scale and supports tag-based policies.
  9. Data contract. Producer-consumer agreement (schema, semantics, SLAs) version-controlled and tested. Breaks the silent-coupling problem.
  10. PII propagation. Tag at source, propagate through lineage automatically; downstream columns derived from a PII column inherit sensitivity until proven otherwise.

39. Agentic AI Design — Quiz With Reasoning

  1. ReAct = Reasoning + Acting interleaved with tools. Visible reasoning helps debugging; structured tool calls allow safe execution.
  2. Separate "judge" LLM is an independent quality check decoupled from generation; less prone to self-confirmation bias.
  3. Cheap dedupe across LLM calls. Cluster inputs by template/embedding similarity; reuse outputs for similar inputs.
  4. Trace store. Dedicated trace store (Langfuse, Phoenix, Arize) or OTel-compatible backend. Generic logging is too coarse for prompt + tool + retrieval traces.
  5. Anthropic-style "tool use". Model emits structured tool_use blocks; orchestrator executes and feeds results back.
  6. Bound max iterations. Cost + safety; prevent infinite loops, runaway tool use, and budget overrun.
  7. RAG vs fine-tune for changing data → RAG. Fine-tuning is for stable behavior, not freshness.
  8. GraphRAG. Retrieval that traverses a knowledge graph, expanding context via edges (lineage, ownership, policy).
  9. Agent failure mode #1. Tool errors / schema mismatches; second is over-iteration / cost.
  10. Tenant filters. Enforce pre-retrieval as filter and pre-generation as assert.

40. Coding-Round Pattern Recognition — Quiz With Reasoning

  1. FIFO queue in Python = collections.deque because popleft is O(1). list.pop(0) is O(n).
  2. Top-K largest streaming numbers. Min-heap of size K. New element compared to heap top; replace if larger. O(n log k).
  3. Subarray sum equals K. Prefix sum + hash map. Tracks how many prefix sums equal current_prefix - K.
  4. Dependencies and ordering = topological sort. Detects cycles when nodes never reach in-degree 0.
  5. Connected components / repeated merges = Union-Find. With path compression + union by rank, operations are nearly O(1).
  6. Next greater element = monotonic stack of indices in decreasing value order. Pop until top is greater than current.
  7. Shortest unweighted path = BFS. Each level is one hop, so first reach is shortest.
  8. Shortest weighted path with non-negative weights = Dijkstra. Heap maintains current best; popped distance is final because edge weights are non-negative.
  9. Minimum speed/capacity/time = binary search on answer. Requires monotonic feasibility: if X works, all values ≥ X work.
  10. All combinations = backtracking. Grow a partial solution; backtrack on invalid branches.
  11. Sort lower bound = O(n log n). Comparison sorts have lower bound Ω(n log n); Timsort hits it on real-world data.
  12. Heap push/pop = O(log n). Sift-up/down along height.
  13. bisect_left returns the first index where the value can be inserted to keep sorted order; equivalent to first index with arr[i] >= target.
  14. Recursion on deep graph. Stack-overflow risk with deep recursion. Use iterative DFS or raise recursion limit cautiously.
  15. BFS vs DFS memory. BFS holds an entire level (can be large). DFS holds path depth. Both are O(V) worst case but with different shapes.
  16. State in DP is a representation of the subproblem whose answer composes the bigger answer. State must capture all decisions affecting the future.
  17. Monotonic feasibility. Property where if X works, every X' on one side also works (or fails). Required for binary search on the answer.
  18. No false negatives, possible false positives = Bloom filter. Hash bits set at insert; absent bit guarantees absence; collisions cause false positives.
  19. Approximate distinct count = HyperLogLog. Counts via leading-zero patterns of hashes.
  20. Spark top-N per group = row_number() over Window.partitionBy(...).orderBy(...). Deterministic and efficient because Spark partitions and sorts within partition.

41. Final Exam — Mixed Drill

Twenty rapid questions to take cold; force yourself to answer each in two sentences using the What → Why → How shape.

  1. Why does AQE flip a sort-merge join to broadcast at runtime?
  2. What's the difference between predicate pushdown and dynamic partition pruning?
  3. Why does Iceberg use column IDs for schema evolution instead of names?
  4. How does an outbox row + Debezium relay deliver effectively-once?
  5. Why is ts >= '...' AND ts < '...' indexable but date_trunc('day', ts) = '...' is not?
  6. Why does a heap of size beat sorting for top-K?
  7. Why does volatile not make count++ thread-safe?
  8. Why is BFS the right algorithm for shortest unweighted path?
  9. Why is hybrid retrieval more robust than dense-only?
  10. Why does the KV cache turn generation into per step?
  11. Why is MEMORY_AND_DISK_SER usually safer than MEMORY_ONLY for big DataFrames?
  12. Why is index-free adjacency fast for graph traversal but irrelevant for aggregations?
  13. Why is partition-by-user_id a bad idea for daily logs?
  14. Why does Iceberg need maintenance jobs (expire_snapshots, rewrite_data_files)?
  15. Why does ABAC compose better than RBAC for multi-tenant data access?
  16. Why do agents need hard caps on iterations and tokens?
  17. Why does groupByKey().mapValues(_.sum) lose to reduceByKey?
  18. Why is the 32 GB heap line important for Elasticsearch?
  19. Why does guarantee strong reads in a leaderless quorum store?
  20. Why is "small files" a worse problem on object stores than on HDFS?

If you can answer all twenty in 60 seconds each with confidence, you're ready.

Back to Blog About the Author
🧘