Exploring Azure Data Explorer and Best Practices

A self-sufficient deep-dive on Azure Data Explorer (ADX/Kusto) — architecture, the KQL language from zero to advanced, ingestion patterns, performance/cost levers, and operational best practices.

Azure Data Explorer — Deep Dive

What you'll leave with: a correct mental model of how ADX stores and queries data, enough KQL to get real work done (including time-series, joins, and advanced analytics), production-grade ingestion patterns, and a checklist of performance & cost levers you can pull before you ever ask support for help.

Try everything below on the free shared cluster: https://dataexplorer.azure.com/clusters/help/.


1. What ADX actually is (and what it isn't)

Azure Data Explorer (internal codename Kusto) is a columnar, append-only, time-series–optimised analytics store with its own query language (KQL). Think "log analytics engine" — in fact, Log Analytics, Application Insights, Microsoft Sentinel, Defender, Fabric Real-Time Intelligence, and even parts of Teams telemetry all run on ADX under the hood.

What it's great at: - Ingesting hundreds of thousands of events per second per cluster. - Running ad-hoc analytical queries over billions of rows in seconds. - Time-series analytics (sessionisation, sliding windows, anomaly detection). - Text search (full-text has, contains, regex). - Geo, IP, and graph operators out of the box.

What it's not: - Not a transactional DB. No ACID row-level updates. You append; you don't UPDATE single rows. - Not a data warehouse for complex star-schema OLAP with joins across huge fact tables. It can do joins, but a cluster tuned for ADX workloads shines on one-big-table plus small dimension tables. - Not a replacement for Azure SQL / Cosmos DB for operational workloads.

Clarification worth making up front: ADX, Fabric KQL Database, Log Analytics, and Sentinel all speak the same KQL against the same engine. The skills transfer cleanly.


2. Architecture — the mental model that explains performance

flowchart TB subgraph Ingest E[Event Hubs / Kafka / IoT Hub / ADF / Fluent-bit] --> IN[Ingestion nodes] end IN --> EX[Extents: compressed, columnar
immutable shards on Azure Storage] subgraph Cluster subgraph Engine nodes C1[Hot cache NVMe SSD] --- C2[Hot cache NVMe SSD] end end EX --> C1 EX --> C2 Q[KQL Query] --> Admin[Admin node] --> Engine[Engine nodes] Engine --> Result[Result set]

Key pieces to internalise:

  • Extent (a.k.a. shard) — the unit of storage. Each extent is columnar, compressed, immutable, and covers a time range. Extents live in Azure Blob Storage; the hot cache pulls recent/frequent ones to local NVMe on the engine nodes.
  • Columnar storage + encoding — strings are dictionary-encoded, numbers are delta-encoded, etc. This is why where Level == "Error" over a billion rows returns in milliseconds.
  • Hot cache vs cold storage — your cache policy per table decides how much history is kept on local NVMe. Queries over hot data are ~10–100× faster than cold.
  • Merge policy — small extents are periodically merged into bigger ones (compaction). Lots of tiny ingestions → lots of tiny extents → slow queries until merge catches up. Prefer batched ingestion.
  • Update policy — a server-side function that runs on ingestion into a source table and writes a transformed row into a target table. This is how you implement streaming ETL inside ADX without any external pipeline.
  • Materialized views — pre-aggregated, continuously maintained queries. Think "indexed view". Great for rollups ("events per minute per tenant") you query constantly.

The two policies you will touch the most:

Policy What it controls Default Typical tweak
Retention How long data is kept before deletion 365 days Shorten aggressively for high-volume raw telemetry
Cache How much of retention is on NVMe 31 days Extend for frequently-queried tables; shrink for archival

Rule of thumb: cache < retention ≤ real business need. Paying to keep cold data is cheap; paying to keep everything hot is not.


3. KQL — from zero to useful in one page

KQL reads top-to-bottom, left-to-right, like a Unix pipeline. You start with a table and pipe (|) it through operators. Unlike SQL you never write SELECT before knowing the source.

3.1 The operators you will use 90% of the time

// Canonical shape
StormEvents
| where StartTime > ago(7d) and State == "FLORIDA"
| project StartTime, EventType, DamageProperty, DeathsDirect
| summarize Events = count(), Damage = sum(DamageProperty) by EventType
| top 5 by Damage desc
| order by Damage desc
  • where — filter rows (must come as early as possible for performance).
  • project / project-away / project-rename — choose or rename columns.
  • extend — add computed columns.
  • summarize ... by ... — GROUP BY with aggregations (count(), sum(), avg(), dcount(), percentile(), make_set(), make_list()).
  • top N by col — k-largest; cheaper than order by + take N.
  • join kind=inner|leftouter|innerunique ... — like SQL joins, but right side is broadcast by default, so keep it small.
  • union Table1, Table2 — vertical stack.
  • lookup — optimised inner join where right side is a small dimension table.
  • render — ask the UI to chart (timechart, columnchart, piechart).

3.2 Time-series — the real superpower

Perf
| where TimeGenerated > ago(1d) and CounterName == "% Processor Time"
| summarize avg(CounterValue) by bin(TimeGenerated, 5m), Computer
| render timechart

bin(TimeGenerated, 5m) buckets rows into 5-minute windows. Swap operators:

  • make-series — gap-fills missing buckets, returns arrays per series: kql Perf | make-series avg(CounterValue) default=0 on TimeGenerated step 5m by Computer | extend anomalies = series_decompose_anomalies(avg_CounterValue) | render anomalychart
  • series_decompose / series_decompose_anomalies / series_decompose_forecast — built-in seasonal decomposition and anomaly detection. No ML pipeline required.
  • hopping_window / session_window — sliding and session-based aggregations.

3.3 Joins and the hint.strategy

Joins in ADX are not symmetric. The right side is the build side, broadcast to every node; the left side is the probe side, streamed. Put the small table on the right.

Events
| join kind=leftouter hint.strategy=broadcast (
    Users | project UserId, Tier
  ) on UserId

If the right side is large but partitioned by the join key, use hint.strategy=shuffle.

3.4 User-defined functions

let topN = 10;
let errorsFor = (app:string) {
    Logs
    | where AppName == app and Level == "Error"
    | summarize Count = count() by bin(Timestamp, 1h)
};
errorsFor("payments")
| top topN by Count desc

Save reusable logic as stored functions on the cluster: they appear alongside tables and show up in IntelliSense for consumers.

3.5 Dynamic (JSON) columns

ADX has first-class JSON:

AuditLog
| extend userId = tostring(Properties.userId),
         roles  = Properties.roles           // dynamic array
| mv-expand role = roles to typeof(string)   // explode array to rows
| summarize count() by userId, role

Prefer flattening at ingest time via an update policy if you query the same nested field constantly — it's indexed and faster.


4. Ingestion — the patterns that actually scale

Ingestion is where most ADX deployments go wrong. Rules:

  • Batch, don't drip. Each ingest call creates extents. Aim for batches of ≥ 1000 rows or ≥ 1 MB. Use the managed ingestion endpoint (https://ingest-<cluster>.<region>.kusto.windows.net) which batches for you.
  • Prefer the managed pipelines to writing your own. Event Hubs → ADX data connection, IoT Hub → ADX, Event Grid → ADX blob ingestion. Each is zero-code and handles retries, compression, and batching.
  • Use streaming ingestion only when you need p99 latency < 10s and your throughput is modest. It bypasses batching and has overhead.
  • Compress — ingest .gz, .snappy, or .parquet. ADX decompresses server-side.
  • Schema-on-write — define the table schema; ADX is not schema-on-read. Use mapping schemas for JSON/CSV/Parquet.
  • Partition on time. Don't invent a custom partition key unless you've proven it helps; the default time partitioning is almost always right.

A common reference pipeline:

flowchart LR K[Kafka/Event Hubs] --> DC[Data connection
Event Hubs → ADX] DC --> Raw[Raw_events table] Raw -->|update policy| Clean[Clean_events table] Clean -->|materialized view| Hourly[Hourly_rollup MV]

The update policy performs parsing/flattening exactly once at ingest; downstream queries hit the clean table. The materialized view pre-computes hot rollups so dashboards don't re-aggregate billions of rows on every refresh.


5. Performance — the levers that move things 10×

Before you scale up, pull these levers:

  1. Push filters down. where on time + partition key first, always. The engine uses extent metadata to skip entire shards.
  2. Avoid contains / regex when you can. has is indexed (token-based); contains is not. For substring matching, consider storing a tokenised column.
  3. Project early. project right after where reduces the columns dragged through subsequent operators.
  4. Use summarize by instead of distinct: T | summarize by X is cheaper than T | distinct X in many versions.
  5. Watch your joins. Always check .show queryplan. If you see a giant right-side build, flip the sides or add a hint.strategy=shuffle.
  6. Materialized views for hot dashboards. A 30-day summarize over a trillion rows should run in < 1s if the MV is warm.
  7. Cache policy tuning. If a query is slow and .show cluster extents shows the extents aren't cached, either extend cache or narrow the query's time window.
  8. Follower databases. For read-heavy scenarios, attach a follower cluster that reads the same storage — isolates query load from ingestion load.
  9. Engine + data management SKUs. Engine SKUs (Dv5/Lsv3/EC) matter more than count for interactive workloads. Turn on auto-scale*.

Diagnostics you should know:

.show queries
| where StartedOn > ago(1h) and State == "Failed"

.show table MyTable details        // size, hot cache, retention
.show cluster extents              // per-extent stats
.show commands                     // control commands history

6. Cost model — what you're actually paying for

Two cost axes:

  • Compute — cluster SKU × number of instances × hours. This is the bulk of the bill.
  • Storage — Azure Blob Storage for extents. Cheap; typically < 10% of compute cost.

Savings playbook:

  • Stop the cluster when nobody is querying. Dev/test clusters should run on a schedule.
  • Autoscale based on cache utilisation or CPU.
  • Shorten cache policy on high-volume tables you rarely query beyond 7 days.
  • Use the right SKU family. Dv2 is legacy; Lsv3 (local NVMe) is the modern default for most interactive workloads; EC* for confidential compute.
  • Partition / rollups. Keeping raw telemetry 7 days + daily rollups 2 years is often far cheaper than keeping raw telemetry 2 years.

7. Operations — the non-negotiables

  • Principals & RBAC — Admin, User, Viewer, Ingestor, Monitor at cluster/database/table scope. Prefer Entra groups over individual principals.
  • Row Level Security (RLS) — function on the table that returns the rows a caller can see; ADX applies it transparently.
  • Private endpoints + managed identity — disable public network access, use MI for ingestion from Event Hubs / ADF.
  • Backup — ADX persists to Blob Storage and supports follower databases and cluster restore; design for region redundancy if you need DR.
  • Deployment as code.kql scripts in a repo, applied via kusto-cli, ADX CLI, or Bicep/Terraform. Treat schema, functions, update policies, and materialized views like source code.
  • Observability on the cluster itself.show diagnostics, .show metrics, Azure Monitor integration for ingestion lag, cache utilisation, query duration.

8. A realistic end-to-end mini-project

Building a "who is abusing our public API" detector on top of ADX:

  1. Ingestion: API gateway access logs → Event Hubs → ADX raw_api_logs (append only).
  2. Update policy: parse URL, IP, user-agent, extract tenant & route → clean_api_logs.
  3. Materialized view: requests_per_min_per_ip grouped by bin(Timestamp, 1m), Ip, Route.
  4. Detection KQL: make-series over the MV + series_decompose_anomalies → rows where an IP's traffic spikes 5σ above its own 7-day baseline.
  5. Notebook / Grafana: chart the anomalies; the on-call gets a Slack alert via an Azure Function that runs the KQL on a schedule.

Every single step above uses only ADX features covered on this page.


9. Corrections to common misunderstandings

  • ❌ "KQL is like T-SQL." → It reads top-down like a pipeline, has different scoping rules, and let is not a CTE — it binds a single expression for the rest of the query.
  • ❌ "More ingestion connections = faster ingestion." → The opposite; each connection spawns extents. Batch.
  • ❌ "I'll UPDATE this row." → You won't. Append a corrected row and use arg_max(Timestamp, *) by Id to read the latest.
  • ❌ "Joins are free." → Joins broadcast the right side; large joins are the #1 cause of slow queries. Use lookup for small dim tables and mind the side order.
  • ❌ "Cache policy = retention." → They are independent. You can keep 2 years of retention with only 30 days in cache.

10. TODO — self-sufficient action list

Everything below is achievable using only this page + the free help cluster.

Fluency

  • [ ] On the help cluster, solve 10 KQL exercises on StormEvents, SecurityEvent, and Perf: top-N, time-bucketing, joins, make-series + anomalies, mv-expand.
  • [ ] Write three KQL stored functions and invoke them from a parent query.

Your own cluster

  • [ ] Provision a Dev SKU ADX cluster + database via Azure CLI and via Bicep. Commit the Bicep.
  • [ ] Enable stop/start automation (Azure Automation runbook or simple Function).
  • [ ] Enforce private endpoint + system-assigned managed identity.

Ingestion pipeline

  • [ ] Create an Event Hubs namespace + hub and a producer (5 lines of Python) sending 1 event / second.
  • [ ] Wire an Event Hubs data connection into an ADX table. Verify rows appear without code.
  • [ ] Add an update policy to parse a JSON payload into typed columns in a cleaned table. Measure per-row ingest duration with .show ingestion failures.

Modelling + perf

  • [ ] Set a realistic retention and cache policy on each table and record rationale.
  • [ ] Build a materialized view for a hot rollup. Compare its query latency vs the raw query. Put the numbers in the post.
  • [ ] Run .show queryplan on a slow query, identify the bottleneck, fix it, and document before/after.

Analytics

  • [ ] Use make-series + series_decompose_anomalies on a real metric; produce a chart with at least one labelled anomaly.
  • [ ] Write a sessionisation query (session window by user) and extract median session length.
  • [ ] Implement a simple RLS policy by tenant, verify with a second principal.

Ship

  • [ ] Build a tiny Grafana (or Fabric) dashboard driven entirely by ADX queries.
  • [ ] Schedule a KQL-based alert via Azure Monitor that emails on anomaly.
  • [ ] Commit all .kql (schemas, functions, policies, views) to the repo; apply via CLI on PR merge.

When every box above is ticked and you can answer "what will my cluster cost next month and why?" without guessing, flip the post to status: published.

Back to Blog About the Author
🧘