system designintermediate 32m read

MS Stack Ch 15 — Observability

Serilog structured logging, Application Insights, OpenTelemetry, correlation IDs, distributed tracing, KQL queries for telemetry, alerts, dashboards. Knowing what your app is doing in production.

Chapter 15 of From Novice to Fluent on the Modern Microsoft Web Stack — a 22-chapter self-study plan.

Why this chapter

You ship code. Then you stop seeing it. Without observability, every production deploy is a leap of faith and every incident is a 3am archaeology session through raw text logs. Done well, observability turns "the API feels slow" into "p99 on /checkout for the EU region jumped from 180ms to 1.2s at 14:03, correlated with a Kusto cluster CPU spike, after deploy d4e5f6". Done badly, observability is a six-figure App Insights bill, a dashboard nobody reads, and every alert ignored as noise.

Shipping-level observability means: structured logging via Serilog or ILogger<T>, Application Insights wired up, a handful of standard alerts and a single dashboard per service. Expert-level observability means: a deliberate sampling policy you can defend at cost-review, custom telemetry processors that redact PII before it hits the wire, a per-service SLO with an error budget burn-rate alert, and a fluency with KQL that lets you author new analytic queries in real time during an incident.

You finish this chapter when you can stand up a fresh service with logging, metrics, tracing and correlation IDs flowing through every hop, write the KQL that surfaces the worst-performing endpoint in the last hour, and explain to a teammate why message templates beat string interpolation every single time.

Log structurally

Message templates with named properties, never string interpolation.

Correlate every hop

W3C `traceparent` propagated through every service, log line and dependency call.

Pick the right pillar

Logs for narratives, metrics for rates, traces for latency causality.

Sample deliberately

Adaptive sampling sized to budget; pinned exceptions and rare events.

Alert on signal

Six canonical alerts per service; tune to baseline; respect alert fatigue.

Query like a native

KQL fluency turns App Insights into a microscope, not a graveyard.

Concepts and depth

Logging philosophies: levels, structured vs string, sampling

Levels exist to let downstream consumers filter without losing data at the source. The .NET / Serilog vocabulary:

Trace — firehose; per-statement breadcrumbs; off in production.
Debug — useful in dev, off in prod by default.
Information — business events and lifecycle (request handled, user created, queue dequeued).
Warning — degraded behaviour that did not fail: a retry, a fallback, a circuit half-open.
Error — an operation failed; a user-visible problem.
Critical — process is in trouble (OOM, can't serve at all).

Pick the level at the source, filter at the sink. The classic mistake is logging at Information from inside a hot loop — the next mistake is "fixing" it by raising the minimum level to Warning, which throws away the Information lines you actually need from elsewhere.

Structured logging means the log line is a key-value record, not a flat string. The same event "Created user alice@contoso.com in 234ms" becomes a structured record { "message": "Created user {Email} in {ElapsedMs}ms", "Email": "alice@contoso.com", "ElapsedMs": 234 }. The sink can filter, group and aggregate on the properties. String interpolation throws all of that away — the engine cannot distinguish "in 234ms" from "in 2.34s" without re-parsing.

Sampling is the universal control for cost. The choices: drop a percentage uniformly (cheap, lossy), keep a percentage but always keep exceptions (the App Insights default), sample by tenant/customer/operation (custom processor), or rate-limit (App Insights' adaptive sampling). Sampling lives at the source where it is cheapest and richest.

Good enough to ship

• Information at the source; Warning at the sink for chatty libs.
• Structured templates only; never $"..." in log calls.
• Adaptive sampling at a defensible rate.

Expert tier

• Per-namespace level overrides codified in config.
• Tenant-aware sampling for cost fairness.
• Compliance redaction processors before the wire.

Serilog: sinks, enrichers, message templates, LogContext, JSON config

Serilog is the canonical structured-logging library on .NET. Its mental model is small:

Sink — where logs go (Console, File, ApplicationInsights, Seq, Elasticsearch, …). Multiple sinks per logger.
Enricher — augments every log event with extra properties (machine name, service name, trace id, correlation id).
Message template — a string with {Named} placeholders that the sink captures as properties.
LogContext — a thread/async-local scope of properties; everything inside the scope inherits them.

Log.Logger = new LoggerConfiguration()
    .MinimumLevel.Information()
    .MinimumLevel.Override("Microsoft.AspNetCore", LogEventLevel.Warning)
    .Enrich.FromLogContext()
    .Enrich.WithMachineName()
    .Enrich.WithProperty("Service", "queries-api")
    .Enrich.WithProperty("Env", builder.Environment.EnvironmentName)
    .WriteTo.Console(new Serilog.Formatting.Compact.CompactJsonFormatter())
    .WriteTo.ApplicationInsights(
        builder.Configuration["ApplicationInsights:ConnectionString"]!,
        TelemetryConverter.Traces)
    .CreateLogger();
builder.Host.UseSerilog();

public class UserService(ILogger<UserService> log)
{
    public async Task<User> CreateAsync(NewUser dto, CancellationToken ct)
    {
        using var _ = LogContext.PushProperty("Email", dto.Email);
        log.LogInformation("Creating user {@Dto}", dto);
        // ... DB work ...
        log.LogInformation("Created user {UserId} in {ElapsedMs}ms", user.Id, sw.ElapsedMilliseconds);
        return user;
    }
}

Two micro-rules worth internalising:

{@Object} serialises an object as JSON; bare {Object} calls ToString(). Use @ for DTOs you actually want to inspect later.
LogContext.PushProperty adds a property to every log event for the duration of the using block — perfect for request-scoped or operation-scoped properties.

JSON-configured Serilog (appsettings.json Serilog section) lets you change levels and sinks without redeploying — turn one chatty namespace down at 2am when an alert is exploding.

{
  "Serilog": {
    "MinimumLevel": {
      "Default": "Information",
      "Override": {
        "Microsoft.AspNetCore": "Warning",
        "System.Net.Http": "Warning"
      }
    },
    "Enrich": ["FromLogContext", "WithMachineName"],
    "WriteTo": [
      { "Name": "Console" },
      { "Name": "ApplicationInsights", "Args": { "telemetryConverter": "Serilog.Sinks.ApplicationInsights.TelemetryConverters.TraceTelemetryConverter, Serilog.Sinks.ApplicationInsights" } }
    ]
  }
}

ASP.NET Core logging abstraction: ILogger<T> and Serilog plug-in

ILogger<T> is the framework's logging abstraction. You take a dependency on ILogger<MyService> from DI and call log.LogInformation(...). The framework routes the call through whatever ILoggerProviders are registered.

When builder.Host.UseSerilog() runs, Serilog replaces the default providers with its own pipeline. From the consumer's side nothing changes — you keep using ILogger<T>; Serilog catches the call and applies templates, enrichers and sinks. This is why "switch from default logging to Serilog" is a one-line change for the entire codebase.

// Consumer: framework-style; Serilog routes behind the scenes
public class CheckoutHandler(ILogger<CheckoutHandler> log) { /* ... */ }

The corollary: do not call Log.Logger.Information(...) directly from your code. That couples you to Serilog at every call site. Stay on ILogger<T>; the abstraction is the point.

Application Insights data model

App Insights organises telemetry into a small, fixed set of tables:

requests — every inbound HTTP request handled by your app. Columns include name, url, duration, resultCode, success, operation_Id.
dependencies — every outbound call your app made (HTTP, SQL, Storage, Service Bus, Kusto, custom). target, type, data, duration, success.
exceptions — uncaught and TrackException-reported exceptions. type, outerMessage, details, severityLevel.
traces — ILogger output (level Information+ by default after sampling). message, severityLevel, customDimensions.
customEvents — TrackEvent-emitted business events. name, customDimensions, customMeasurements.
customMetrics — TrackMetric/Meter API output. name, value, valueCount, valueSum, customDimensions.
pageViews — browser-side telemetry (only with the App Insights JS SDK).

Every row in every table has correlation columns: operation_Id (the trace ID), operation_ParentId (the parent span ID). You join across tables on operation_Id to reconstruct end-to-end behaviour.

Correlation: operation_Id, operation_ParentId, propagating across services

App Insights uses the W3C Trace Context spec. Every HTTP request carries a traceparent header:

When your service handles an inbound request, ASP.NET Core reads traceparent, sets the Activity.Current trace/span ids, and stamps every telemetry item that follows with operation_Id = traceId. When your service calls an outbound HTTP API via HttpClient, the HttpClient instrumentation injects a fresh traceparent with a new span id and the original trace id. The downstream service sees the same trace id, links itself as a child of your span, and so on.

var activity = Activity.Current;
log.LogInformation("Processing {TraceId} / {SpanId}", activity?.TraceId, activity?.SpanId);
activity?.SetTag("user.id", userId);
activity?.SetTag("query.id", queryId);

The corollary: log the TraceId (or operation_Id) on every interesting event and your KQL queries can join all four tables in one shot.

Good enough to ship

• traceparent flows through HttpClient + ASP.NET Core out of the box.
• Log operation_Id on every interesting line.
• Use App Insights "Transaction search" to view a trace.

Expert tier

• Stitch in tracestate for vendor-specific context.
• Carry trace context across non-HTTP boundaries (Service Bus, Storage).
• Configure baggage propagation for tenant-aware tracing.

ITelemetryInitializer and ITelemetryProcessor

App Insights gives you two pipeline hooks:

ITelemetryInitializer — runs before sampling, on every telemetry item, and can add or modify properties. Use it to attach context that should appear on every record (current user id, tenant id, deployment slot).
ITelemetryProcessor — runs after initializers, in a chain, and can drop or transform items. Use it to filter (e.g. drop health-check requests, drop one chatty dependency), or to redact (replace PII with hashes).

public class TenantInitializer(IHttpContextAccessor ctx) : ITelemetryInitializer
{
    public void Initialize(ITelemetry t)
    {
        var tid = ctx.HttpContext?.User.FindFirstValue("tid");
        if (tid is not null) t.Context.GlobalProperties["TenantId"] = tid;
    }
}
 
public class HealthCheckFilter(ITelemetryProcessor next) : ITelemetryProcessor
{
    public void Process(ITelemetry t)
    {
        if (t is RequestTelemetry r && r.Url?.AbsolutePath.StartsWith("/health/") == true) return;
        next.Process(t);
    }
}
 
builder.Services.AddSingleton<ITelemetryInitializer, TenantInitializer>();
builder.Services.AddApplicationInsightsTelemetryProcessor<HealthCheckFilter>();

Use an initializer for "always add this property"; a processor for "drop this whole class of records".

Adaptive sampling and blind spots

Adaptive sampling targets a fixed item rate (default 5/s per instance) and computes a sampling percentage on the fly. It always keeps correlated items together — if a requests row survives, the matching dependencies, traces and exceptions rows survive too. It always keeps exceptions and rarely-seen items.

The blind spots:

Counts are extrapolated. If sampling kept 1 of every 10 requests, App Insights extrapolates counts by 10. Aggregations like count() produce the extrapolated total, not the actual sampled count. Percentile calculations approximate from the sampled rows. For high-traffic services this is fine; for rare events you can over- or under-count.
Cardinality of low-frequency dimensions is unstable. If a tenant gets one request per minute and sampling keeps 10%, you might see them and you might not.
Custom processors run before sampling, but custom logic that depends on item correlation should reason about the sampling outcome (e.g. "if I keep this request, will the dependency rows survive too?").

For ultra-cost-sensitive paths, write a custom adaptive policy: never sample Premium tenants, always keep /checkout traffic, sample static /assets/* more aggressively.

Live Metrics stream

The Live Metrics blade shows a real-time, low-latency feed of every request, dependency and exception, unsampled, for the next minute. It is invaluable during a deploy: you see the new code's behaviour in real time without waiting for the ~3 minute App Insights ingest delay.

Live Metrics also reveals what sampling is dropping — if the dashboards say "no errors" but Live Metrics shows them, your sampling policy is mis-tuned. Keep the Live Metrics tab open during any deploy.

Writing KQL against App Insights tables

The five queries you will run more than any others:

// Slowest endpoints, p95 + error rate
requests
| where timestamp > ago(1h)
| summarize
    Count = count(),
    p95 = percentile(duration, 95),
    p99 = percentile(duration, 99),
    ErrorRate = countif(success == false) * 100.0 / count()
  by name
| order by p95 desc | take 20

// Dependency failures by target
dependencies
| where timestamp > ago(1h)
| where success == false
| summarize Failures = count() by target, type, resultCode
| order by Failures desc

// End-to-end trace for one operation_Id
let opId = "00000000-0000-0000-0000-000000000000";
union withsource = Table requests, dependencies, traces, exceptions
| where operation_Id == opId
| project timestamp, Table, Name = coalesce(name, type, message), duration, resultCode
| order by timestamp asc

// Per-user timeline
let uid = "alice@contoso.com";
union requests, exceptions, customEvents
| where user_AuthenticatedId == uid
| order by timestamp desc | take 200

// Request × trace join for correlated diagnostics
requests
| where timestamp > ago(1h) and success == false
| join kind=leftouter (
    traces | where timestamp > ago(1h) | project operation_Id, severityLevel, message
) on operation_Id
| project timestamp, name, duration, resultCode, severityLevel, message
| order by timestamp desc

These five plus the eight KQL operators from chapter 12 carry you through 95% of incident response.

Alerts: metric vs log-search, action groups

App Insights supports two alert types:

Metric alerts — operate on platform metrics (CPU, memory, request rate) with sub-minute latency and low cost. Cheap and fast; limited to a small set of dimensions.
Log-search alerts — operate on KQL queries with a configurable evaluation period (1–60 minutes). Expressive (any KQL is valid) but slower and per-evaluation cost. Use when the alert needs joins, filters or custom thresholds.

An action group is the reusable definition of "what to do when an alert fires" — email, SMS, voice, webhook, Logic App, Function. Define them once per environment, attach to many alerts.

The six canonical alerts every production service should have:

Live failure rate — error rate > 5% for 5 minutes (catches bad deploys).
p99 latency — p99 > 2× rolling baseline for 5 minutes (catches downstream degradation).
Dependency health — downstream failure rate > 1% for 5 minutes (catches DB/Kusto wobbles).
Resource saturation — CPU > 80% for 15 minutes (autoscale masks; alert surfaces).
No-traffic detection — request rate == 0 for 10 minutes (catches dead deploys).
Cost guard — autoscale at max instances for >30 minutes (you are over-budget right now).

Tune thresholds against weeks of baseline; an alert that fires three times a day during normal operation is worse than no alert at all.

OpenTelemetry: the converging standard

OpenTelemetry (OTel) is the vendor-neutral observability standard. App Insights, Datadog, Honeycomb, Jaeger and every major backend now accept OTel data; you instrument once and export anywhere.

builder.Services.AddOpenTelemetry()
    .ConfigureResource(r => r.AddService("queries-api", serviceVersion: "1.2.3"))
    .WithTracing(t => t
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddSqlClientInstrumentation()
        .AddSource("MyApp.*")
        .AddAzureMonitorTraceExporter(o => o.ConnectionString = aiConn))
    .WithMetrics(m => m
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddRuntimeInstrumentation()
        .AddAzureMonitorMetricExporter(o => o.ConnectionString = aiConn));

New services should default to OTel + the Azure Monitor exporter. Old services on the classic AddApplicationInsightsTelemetry API are fine — both arrive at the same App Insights tables.

Worked examples

Example 1 — Full Serilog + App Insights wiring

using Serilog;
using Serilog.Events;
using Serilog.Context;
 
var builder = WebApplication.CreateBuilder(args);
 
Log.Logger = new LoggerConfiguration()
    .ReadFrom.Configuration(builder.Configuration)
    .Enrich.FromLogContext()
    .Enrich.WithMachineName()
    .Enrich.WithProperty("Service", "queries-api")
    .Enrich.WithProperty("Env", builder.Environment.EnvironmentName)
    .WriteTo.Console(new Serilog.Formatting.Compact.CompactJsonFormatter())
    .WriteTo.ApplicationInsights(
        builder.Configuration["ApplicationInsights:ConnectionString"]!,
        TelemetryConverter.Traces)
    .CreateLogger();
builder.Host.UseSerilog();
 
builder.Services.AddApplicationInsightsTelemetry();
builder.Services.AddSingleton<ITelemetryInitializer, TenantInitializer>();
builder.Services.AddApplicationInsightsTelemetryProcessor<HealthCheckFilter>();
 
var app = builder.Build();
 
app.Use(async (ctx, next) =>
{
    using var _ = LogContext.PushProperty("TraceId", Activity.Current?.TraceId.ToString());
    using var __ = LogContext.PushProperty("UserId", ctx.User.FindFirstValue("oid"));
    await next();
});
 
app.MapGet("/api/whoami", (ILogger<Program> log, ClaimsPrincipal user) =>
{
    log.LogInformation("whoami called for {Email}", user.Identity?.Name);
    return new { name = user.Identity?.Name };
});
 
app.Run();

Serilog reads its config from appsettings.json so levels can change without a redeploy.
The middleware pushes TraceId and UserId into LogContext; every downstream log line picks them up.
TenantInitializer and HealthCheckFilter shape the App Insights pipeline.

Example 2 — Custom Meter for metrics

public static class QueryMetrics
{
    public static readonly Meter Meter = new("queries-api", "1.0");
    public static readonly Counter<long> Executed = Meter.CreateCounter<long>("queries.executed");
    public static readonly Histogram<double> Duration = Meter.CreateHistogram<double>("queries.duration_ms");
}
 
app.MapPost("/api/q", async (QueryDto dto, IQueryService svc) =>
{
    var sw = Stopwatch.StartNew();
    try
    {
        var result = await svc.RunAsync(dto);
        QueryMetrics.Executed.Add(1, new("type", dto.Type), new("status", "ok"));
        return Results.Ok(result);
    }
    catch
    {
        QueryMetrics.Executed.Add(1, new("type", dto.Type), new("status", "error"));
        throw;
    }
    finally
    {
        QueryMetrics.Duration.Record(sw.Elapsed.TotalMilliseconds, new("type", dto.Type));
    }
});

Meter + Counter/Histogram is the .NET 6+ metrics API; works with both App Insights and OpenTelemetry.
Tags (new("type", dto.Type)) are the cardinality control: keep them bounded.
The histogram makes percentile aggregation cheap downstream.

Example 3 — End-to-end incident KQL

let problemOp = "POST /api/checkout";
let lookback = 1h;
let bucket = 1m;
 
let problemRequests = materialize(
    requests
    | where timestamp > ago(lookback)
    | where name == problemOp
);
 
problemRequests
| summarize Total = count(), Errors = countif(success == false), p95 = percentile(duration, 95)
  by bin(timestamp, bucket)
| render timechart;
 
let problemTraces = problemRequests
| where success == false
| project operation_Id;
 
union
    (problemTraces | join kind=inner (traces) on operation_Id | project timestamp, operation_Id, severityLevel, message),
    (problemTraces | join kind=inner (exceptions) on operation_Id | project timestamp, operation_Id, type, outerMessage)
| order by timestamp desc
| take 100

materialize caches the first sub-query because we reference it twice.
First query produces the time-series chart; second walks just the failing operations through traces and exceptions.
Both queries hit the index on name + timestamp.

Hands-on exercises

Serilog wiring. Wire Serilog + App Insights into a fresh minimal API. Log a business event with a structured template and LogContext.PushProperty. Confirm it lands in the traces table with the pushed properties as custom dimensions.
- You are done when the property appears as a column in the KQL result.
Two-service trace. Stand up two services (A and B). Service A calls B over HTTP. Confirm a single operation_Id propagates through and that the App Insights Transaction Search shows the waterfall.
- You are done when you can point to the parent-child relationship in the trace UI.
Custom Meter. Add queries.executed and queries.duration_ms per Example 2. Generate 100 requests and confirm they appear in the customMetrics table with the tags as dimensions.
- You are done when you can chart p95 by tag in KQL.
Five KQL queries. Write the five canonical queries from the "Writing KQL" section against your own telemetry. Pin them to an Azure Workbook dashboard.
- You are done when the dashboard renders in under 5 seconds.
Alert + action group. Configure a log-search alert: error rate > 5% for 5 minutes → email action group. Deliberately introduce a 500 in 10% of requests and wait for the alert to fire.
- You are done when the email arrives within the alert window.
OpenTelemetry migration. Replace AddApplicationInsightsTelemetry with the OTel block from the OTel section. Confirm telemetry continuity (the same requests, dependencies, traces rows appear).
- You are done when nothing in your dashboards visibly changes.

Self-check questions

Why message templates instead of string interpolation?
What's the difference between an ITelemetryInitializer and an ITelemetryProcessor?
Explain LogContext.PushProperty and the canonical use cases.
What does traceparent look like and which two ids does it carry?
How does adaptive sampling decide what to keep and what to drop?
Why does App Insights' adaptive sampling always keep exceptions?
When do you reach for a metric and when for a log?
Walk through the six canonical alerts and the failure each catches.
What is Live Metrics good for that the regular dashboards are not?
Why is OTel the default for new services?
What's the difference between a metric alert and a log-search alert? Which is faster?
Why is alert fatigue worse than missing one alert?

High-signal resources

Official docs

Books or courses

Observability Engineering — Charity Majors, Liz Fong-Jones, George Miranda. The canonical book.
Site Reliability Engineering + The Site Reliability Workbook — Google. The chapters on SLOs and monitoring are essential.

Practitioner posts

Charity Majors' blog — the foundational "high-cardinality + structured events" arguments.
Honeycomb engineering blog — practical observability deep-dives.
App Insights team blog — releases and OTel migration guides.

Weekly milestones

Day 1. Read the App Insights overview + the W3C Trace Context spec. Do exercise 1. Self-check questions 1–3.
Day 2. Distributed tracing (exercise 2). Self-check questions 4 + 9.
Day 3. Custom metrics (exercise 3). Self-check question 7.
Day 4-5. KQL queries + dashboards (exercise 4). Self-check questions 5–6 + 11.
Day 6-7. Alerts (exercise 5) and OTel migration (exercise 6). Self-check questions 8 + 10 + 12.

How it shows up in the capstone

The capstone wires Serilog + OpenTelemetry → App Insights end-to-end. Every API request carries a W3C traceparent; the trace id flows into Serilog's LogContext, the Kusto SDK's ClientRequestId, and SQL command tags. Custom Meters: queries.executed, queries.duration_ms, cache.hits. Six canonical alerts fire into a Teams channel via a single action group.

A single Azure Workbook dashboard renders the standard six tiles (live health, request rate by op, p50/p95/p99, dependency health, exception clusters, infra utilisation). Adaptive sampling targets 5 items/s per instance; the health-check filter from Example 1 keeps /health/* out of the bill. Pre-prod telemetry goes to a separate App Insights resource so capacity planning is honest.

Previous chapter → Ch 14 — Identity for cloud apps Next chapter → Ch 16 — Resilience patterns