Azure Data Explorer — Deep Dive
What you'll leave with: a correct mental model of how ADX stores and queries data, enough KQL to get real work done (including time-series, joins, and advanced analytics), production-grade ingestion patterns, and a checklist of performance & cost levers you can pull before you ever ask support for help.
Try everything below on the free shared cluster: https://dataexplorer.azure.com/clusters/help/.
1. What ADX actually is (and what it isn't)
Azure Data Explorer (internal codename Kusto) is a columnar, append-only, time-series–optimised analytics store with its own query language (KQL). Think "log analytics engine" — in fact, Log Analytics, Application Insights, Microsoft Sentinel, Defender, Fabric Real-Time Intelligence, and even parts of Teams telemetry all run on ADX under the hood.
What it's great at:
- Ingesting hundreds of thousands of events per second per cluster.
- Running ad-hoc analytical queries over billions of rows in seconds.
- Time-series analytics (sessionisation, sliding windows, anomaly detection).
- Text search (full-text has, contains, regex).
- Geo, IP, and graph operators out of the box.
What it's not: - Not a transactional DB. No ACID row-level updates. You append; you don't UPDATE single rows. - Not a data warehouse for complex star-schema OLAP with joins across huge fact tables. It can do joins, but a cluster tuned for ADX workloads shines on one-big-table plus small dimension tables. - Not a replacement for Azure SQL / Cosmos DB for operational workloads.
Clarification worth making up front: ADX, Fabric KQL Database, Log Analytics, and Sentinel all speak the same KQL against the same engine. The skills transfer cleanly.
2. Architecture — the mental model that explains performance
immutable shards on Azure Storage] subgraph Cluster subgraph Engine nodes C1[Hot cache NVMe SSD] --- C2[Hot cache NVMe SSD] end end EX --> C1 EX --> C2 Q[KQL Query] --> Admin[Admin node] --> Engine[Engine nodes] Engine --> Result[Result set]
Key pieces to internalise:
- Extent (a.k.a. shard) — the unit of storage. Each extent is columnar, compressed, immutable, and covers a time range. Extents live in Azure Blob Storage; the hot cache pulls recent/frequent ones to local NVMe on the engine nodes.
- Columnar storage + encoding — strings are dictionary-encoded, numbers are delta-encoded, etc. This is why
where Level == "Error"over a billion rows returns in milliseconds. - Hot cache vs cold storage — your cache policy per table decides how much history is kept on local NVMe. Queries over hot data are ~10–100× faster than cold.
- Merge policy — small extents are periodically merged into bigger ones (compaction). Lots of tiny ingestions → lots of tiny extents → slow queries until merge catches up. Prefer batched ingestion.
- Update policy — a server-side function that runs on ingestion into a source table and writes a transformed row into a target table. This is how you implement streaming ETL inside ADX without any external pipeline.
- Materialized views — pre-aggregated, continuously maintained queries. Think "indexed view". Great for rollups ("events per minute per tenant") you query constantly.
The two policies you will touch the most:
| Policy | What it controls | Default | Typical tweak |
|---|---|---|---|
| Retention | How long data is kept before deletion | 365 days | Shorten aggressively for high-volume raw telemetry |
| Cache | How much of retention is on NVMe | 31 days | Extend for frequently-queried tables; shrink for archival |
Rule of thumb: cache < retention ≤ real business need. Paying to keep cold data is cheap; paying to keep everything hot is not.
3. KQL — from zero to useful in one page
KQL reads top-to-bottom, left-to-right, like a Unix pipeline. You start with a table and pipe (|) it through operators. Unlike SQL you never write SELECT before knowing the source.
3.1 The operators you will use 90% of the time
// Canonical shape
StormEvents
| where StartTime > ago(7d) and State == "FLORIDA"
| project StartTime, EventType, DamageProperty, DeathsDirect
| summarize Events = count(), Damage = sum(DamageProperty) by EventType
| top 5 by Damage desc
| order by Damage desc
where— filter rows (must come as early as possible for performance).project/project-away/project-rename— choose or rename columns.extend— add computed columns.summarize ... by ...— GROUP BY with aggregations (count(),sum(),avg(),dcount(),percentile(),make_set(),make_list()).top N by col— k-largest; cheaper thanorder by+take N.join kind=inner|leftouter|innerunique ...— like SQL joins, but right side is broadcast by default, so keep it small.union Table1, Table2— vertical stack.lookup— optimised inner join where right side is a small dimension table.render— ask the UI to chart (timechart,columnchart,piechart).
3.2 Time-series — the real superpower
Perf
| where TimeGenerated > ago(1d) and CounterName == "% Processor Time"
| summarize avg(CounterValue) by bin(TimeGenerated, 5m), Computer
| render timechart
bin(TimeGenerated, 5m) buckets rows into 5-minute windows. Swap operators:
make-series— gap-fills missing buckets, returns arrays per series:kql Perf | make-series avg(CounterValue) default=0 on TimeGenerated step 5m by Computer | extend anomalies = series_decompose_anomalies(avg_CounterValue) | render anomalychartseries_decompose/series_decompose_anomalies/series_decompose_forecast— built-in seasonal decomposition and anomaly detection. No ML pipeline required.hopping_window/session_window— sliding and session-based aggregations.
3.3 Joins and the hint.strategy
Joins in ADX are not symmetric. The right side is the build side, broadcast to every node; the left side is the probe side, streamed. Put the small table on the right.
Events
| join kind=leftouter hint.strategy=broadcast (
Users | project UserId, Tier
) on UserId
If the right side is large but partitioned by the join key, use hint.strategy=shuffle.
3.4 User-defined functions
let topN = 10;
let errorsFor = (app:string) {
Logs
| where AppName == app and Level == "Error"
| summarize Count = count() by bin(Timestamp, 1h)
};
errorsFor("payments")
| top topN by Count desc
Save reusable logic as stored functions on the cluster: they appear alongside tables and show up in IntelliSense for consumers.
3.5 Dynamic (JSON) columns
ADX has first-class JSON:
AuditLog
| extend userId = tostring(Properties.userId),
roles = Properties.roles // dynamic array
| mv-expand role = roles to typeof(string) // explode array to rows
| summarize count() by userId, role
Prefer flattening at ingest time via an update policy if you query the same nested field constantly — it's indexed and faster.
4. Ingestion — the patterns that actually scale
Ingestion is where most ADX deployments go wrong. Rules:
- Batch, don't drip. Each ingest call creates extents. Aim for batches of ≥ 1000 rows or ≥ 1 MB. Use the managed ingestion endpoint (
https://ingest-<cluster>.<region>.kusto.windows.net) which batches for you. - Prefer the managed pipelines to writing your own. Event Hubs → ADX data connection, IoT Hub → ADX, Event Grid → ADX blob ingestion. Each is zero-code and handles retries, compression, and batching.
- Use streaming ingestion only when you need p99 latency < 10s and your throughput is modest. It bypasses batching and has overhead.
- Compress — ingest
.gz,.snappy, or.parquet. ADX decompresses server-side. - Schema-on-write — define the table schema; ADX is not schema-on-read. Use mapping schemas for JSON/CSV/Parquet.
- Partition on time. Don't invent a custom partition key unless you've proven it helps; the default time partitioning is almost always right.
A common reference pipeline:
Event Hubs → ADX] DC --> Raw[Raw_events table] Raw -->|update policy| Clean[Clean_events table] Clean -->|materialized view| Hourly[Hourly_rollup MV]
The update policy performs parsing/flattening exactly once at ingest; downstream queries hit the clean table. The materialized view pre-computes hot rollups so dashboards don't re-aggregate billions of rows on every refresh.
5. Performance — the levers that move things 10×
Before you scale up, pull these levers:
- Push filters down.
whereon time + partition key first, always. The engine uses extent metadata to skip entire shards. - Avoid
contains/regexwhen you can.hasis indexed (token-based);containsis not. For substring matching, consider storing a tokenised column. - Project early.
projectright afterwherereduces the columns dragged through subsequent operators. - Use
summarize byinstead ofdistinct:T | summarize by Xis cheaper thanT | distinct Xin many versions. - Watch your joins. Always check
.show queryplan. If you see a giant right-side build, flip the sides or add ahint.strategy=shuffle. - Materialized views for hot dashboards. A 30-day
summarizeover a trillion rows should run in < 1s if the MV is warm. - Cache policy tuning. If a query is slow and
.show cluster extentsshows the extents aren't cached, either extend cache or narrow the query's time window. - Follower databases. For read-heavy scenarios, attach a follower cluster that reads the same storage — isolates query load from ingestion load.
- Engine + data management SKUs. Engine SKUs (Dv5/Lsv3/EC) matter more than count for interactive workloads. Turn on auto-scale*.
Diagnostics you should know:
.show queries
| where StartedOn > ago(1h) and State == "Failed"
.show table MyTable details // size, hot cache, retention
.show cluster extents // per-extent stats
.show commands // control commands history
6. Cost model — what you're actually paying for
Two cost axes:
- Compute — cluster SKU × number of instances × hours. This is the bulk of the bill.
- Storage — Azure Blob Storage for extents. Cheap; typically < 10% of compute cost.
Savings playbook:
- Stop the cluster when nobody is querying. Dev/test clusters should run on a schedule.
- Autoscale based on cache utilisation or CPU.
- Shorten cache policy on high-volume tables you rarely query beyond 7 days.
- Use the right SKU family.
Dv2is legacy;Lsv3(local NVMe) is the modern default for most interactive workloads;EC*for confidential compute. - Partition / rollups. Keeping raw telemetry 7 days + daily rollups 2 years is often far cheaper than keeping raw telemetry 2 years.
7. Operations — the non-negotiables
- Principals & RBAC — Admin, User, Viewer, Ingestor, Monitor at cluster/database/table scope. Prefer Entra groups over individual principals.
- Row Level Security (RLS) — function on the table that returns the rows a caller can see; ADX applies it transparently.
- Private endpoints + managed identity — disable public network access, use MI for ingestion from Event Hubs / ADF.
- Backup — ADX persists to Blob Storage and supports follower databases and cluster restore; design for region redundancy if you need DR.
- Deployment as code —
.kqlscripts in a repo, applied viakusto-cli, ADX CLI, or Bicep/Terraform. Treat schema, functions, update policies, and materialized views like source code. - Observability on the cluster itself —
.show diagnostics,.show metrics, Azure Monitor integration for ingestion lag, cache utilisation, query duration.
8. A realistic end-to-end mini-project
Building a "who is abusing our public API" detector on top of ADX:
- Ingestion: API gateway access logs → Event Hubs → ADX
raw_api_logs(append only). - Update policy: parse URL, IP, user-agent, extract tenant & route →
clean_api_logs. - Materialized view:
requests_per_min_per_ipgrouped bybin(Timestamp, 1m), Ip, Route. - Detection KQL:
make-seriesover the MV +series_decompose_anomalies→ rows where an IP's traffic spikes 5σ above its own 7-day baseline. - Notebook / Grafana: chart the anomalies; the on-call gets a Slack alert via an Azure Function that runs the KQL on a schedule.
Every single step above uses only ADX features covered on this page.
9. Corrections to common misunderstandings
- ❌ "KQL is like T-SQL." → It reads top-down like a pipeline, has different scoping rules, and
letis not a CTE — it binds a single expression for the rest of the query. - ❌ "More ingestion connections = faster ingestion." → The opposite; each connection spawns extents. Batch.
- ❌ "I'll UPDATE this row." → You won't. Append a corrected row and use
arg_max(Timestamp, *) by Idto read the latest. - ❌ "Joins are free." → Joins broadcast the right side; large joins are the #1 cause of slow queries. Use
lookupfor small dim tables and mind the side order. - ❌ "Cache policy = retention." → They are independent. You can keep 2 years of retention with only 30 days in cache.
10. TODO — self-sufficient action list
Everything below is achievable using only this page + the free help cluster.
Fluency
- [ ] On the
helpcluster, solve 10 KQL exercises onStormEvents,SecurityEvent, andPerf: top-N, time-bucketing, joins,make-series+ anomalies,mv-expand. - [ ] Write three KQL stored functions and invoke them from a parent query.
Your own cluster
- [ ] Provision a Dev SKU ADX cluster + database via Azure CLI and via Bicep. Commit the Bicep.
- [ ] Enable stop/start automation (Azure Automation runbook or simple Function).
- [ ] Enforce private endpoint + system-assigned managed identity.
Ingestion pipeline
- [ ] Create an Event Hubs namespace + hub and a producer (5 lines of Python) sending 1 event / second.
- [ ] Wire an Event Hubs data connection into an ADX table. Verify rows appear without code.
- [ ] Add an update policy to parse a JSON payload into typed columns in a cleaned table. Measure per-row ingest duration with
.show ingestion failures.
Modelling + perf
- [ ] Set a realistic retention and cache policy on each table and record rationale.
- [ ] Build a materialized view for a hot rollup. Compare its query latency vs the raw query. Put the numbers in the post.
- [ ] Run
.show queryplanon a slow query, identify the bottleneck, fix it, and document before/after.
Analytics
- [ ] Use
make-series+series_decompose_anomalieson a real metric; produce a chart with at least one labelled anomaly. - [ ] Write a sessionisation query (session window by user) and extract median session length.
- [ ] Implement a simple RLS policy by tenant, verify with a second principal.
Ship
- [ ] Build a tiny Grafana (or Fabric) dashboard driven entirely by ADX queries.
- [ ] Schedule a KQL-based alert via Azure Monitor that emails on anomaly.
- [ ] Commit all
.kql(schemas, functions, policies, views) to the repo; apply via CLI on PR merge.
When every box above is ticked and you can answer "what will my cluster cost next month and why?" without guessing, flip the post to status: published.