MS Stack Ch 19 — IaC and rollout orchestration
Bicep modules, what-if previews, environments, region rollouts. Awareness of Ev2 (Microsoft internal), Spinnaker, ArgoCD. The pattern for safely deploying infrastructure + apps across regions.
Chapter 19 of From Novice to Fluent on the Modern Microsoft Web Stack — a 22-chapter self-study plan.
Why this chapter
The pipeline from chapter 18 deploys code. Infrastructure — the App Service, the database, the storage account, the network — has to be deployable too, and from the same source-of-truth model: declarative, reviewable, version-controlled, repeatable. That is what Infrastructure-as-Code (IaC) means in practice. Inside Azure, the modern primitive is Bicep; underneath it is ARM (Azure Resource Manager) and its JSON templates.
But IaC is only half the story. The other half is rollout orchestration: how do you safely apply infrastructure changes and code deployments across regions, with bake times, approvals, and automatic rollback when health drops? Inside Microsoft, the answer is Ev2 (Express v2). Outside, the comparable tools are Spinnaker, Argo Rollouts, Octopus Deploy, and GitHub Actions environments chained with custom logic. The shapes differ; the principles are nearly identical.
Shipping-grade looks like "I can author a Bicep file that defines an App Service plan, a web app, and a Key Vault, and deploy it idempotently through a pipeline". Expert-tier looks like "I can split infra from app deploys, design a dev → preprod → canary → broad → global rollout with bake times, and recover when the canary region fails health checks".
You finish this chapter when you can describe — without notes — the full deploy plan for a multi-region web app: where IaC stops and code deploy begins, what each rollout stage proves, and how the system rolls back when a stage fails.
Concepts and depth
ARM templates and Bicep
ARM (Azure Resource Manager) is the API layer for all Azure resource operations. Every CLI call, every portal click, every Terraform apply ends up as an ARM request. ARM accepts declarative templates in JSON: a top-level object listing resources, parameters, variables, outputs. Templates are idempotent — submit the same template twice and the second submission is a no-op (assuming no drift).
JSON ARM templates are correct, complete, and miserable to write. The schema is verbose, the dependency model is implicit (dependsOn arrays you have to remember), and the syntax for expressions ([concat(...)], [reference(...)]) is its own dialect. Bicep is a transpiler: a domain-specific language that compiles to ARM JSON, while looking like a sensible Python-meets-TypeScript file. Same semantics, dramatically better author experience.
// main.bicep
param location string = resourceGroup().location
param appName string
param sku string = 'P1v3'
resource plan 'Microsoft.Web/serverfarms@2024-04-01' = {
name: '${appName}-plan'
location: location
sku: { name: sku }
properties: { reserved: true } // Linux
}
resource app 'Microsoft.Web/sites@2024-04-01' = {
name: appName
location: location
properties: {
serverFarmId: plan.id
httpsOnly: true
siteConfig: {
linuxFxVersion: 'DOTNETCORE|8.0'
minTlsVersion: '1.2'
}
}
}
output appUrl string = 'https://${app.properties.defaultHostName}'
Compile Bicep with bicep build main.bicep (emits main.json) or let az deployment consume Bicep directly. Modules (module x './kv.bicep' = { ... }) are the unit of reuse — your KV definition, your App Service definition, each in its own file, composed in the root template.
Deployment scopes matter: resourceGroup (most common), subscription (for things that span RGs, like assigning roles or creating RGs themselves), managementGroup, and tenant. The az deployment <scope> create command must match the template's targetScope. Mismatches yield an unhelpful error.
# Resource-group scope
az deployment group create \
--resource-group rg-myapp-prod \
--template-file main.bicep \
--parameters appName=myapi-prod
# Subscription scope (e.g., creates RGs)
az deployment sub create \
--location eastus2 \
--template-file subscription.bicep
The what-if preview (az deployment group what-if ...) shows what ARM would change before you apply. It is the IaC equivalent of git diff and should be the default mode in PR review: post the what-if as a PR comment, require it to be clean (or explicitly approved) before merge.
- • Bicep file per service; modules for cross-cutting (KV, log workspace)
- •
az deployment group createfrom a pipeline with federated identity - •
what-ifposted on every PR - • Outputs consumed by downstream stages (URL, connection string ref)
- • Bicep registry hosting shared modules with semver
- • Subscription-scoped Bicep creates RGs + RBAC + policy assignments
- • Drift detection on a schedule: re-run what-if against prod, alert if non-empty
- •
Microsoft.Authorization/policyAssignmentsenforces guardrails in code
Deploy infra, then deploy code
A single pipeline that deploys infra and code together feels efficient and turns out to be the wrong unit. Infrastructure changes shape (a new subnet, a new Key Vault, a new SKU); they happen weekly, sometimes monthly, and often need separate approval. Code changes happen many times a day. Coupling them means an emergency code rollback drags infra changes with it (or worse, cannot proceed because the infra step is also broken).
The pattern is two pipelines, one source-of-truth repo. The infra pipeline runs on infra/** changes, applies Bicep, produces outputs (resource IDs, URLs) as artifacts. The app pipeline runs on src/** changes, consumes the outputs (or names them by convention), deploys the app code into the existing infra. The two share variable groups and service connections but run independently.
The "infra first" rule is enforced by ordering: the app pipeline's deploy stage does not start until the infra pipeline has completed at least once successfully for that environment. New environments require an infra deploy before any app deploy can target them.
# app pipeline declares the infra pipeline as a resource
resources:
pipelines:
- pipeline: infra
source: Infra.Bicep
trigger:
branches: { include: [ main ] }
stages:
- stage: Deploy
jobs:
- deployment: DeployApp
environment: prod
strategy:
runOnce:
deploy:
steps:
- download: infra
artifact: outputs
- script: cat $(Pipeline.Workspace)/infra/outputs/appUrl.txt
- task: AzureWebApp@1
inputs: { ... }
- • Separate infra and app pipelines
- •
infra/andsrc/in the same repo, separately triggered - • App pipeline waits for infra to succeed before deploying
- • Infra and app pipelines in separate repos, app declares infra as a
pipelines:resource - • Infra changes go through a heavier change-management process than code
- • Per-environment promotion: infra-dev → infra-staging → infra-prod, each gated
Idempotency and convergence
The IaC promise is "the template describes desired state; apply it as many times as you want and the result is the same". This property is idempotency, and the achievement of it is convergence. ARM achieves convergence by treating each resource as an upsert: if the resource exists with matching properties, no-op; if it exists with different properties, update; if it does not exist, create.
The gotcha is that not every property is converged. Some resources have PUT-only behaviour where changing a property requires recreate, which ARM will do — and recreate may drop data (a storage account name change, for instance). Read the resource provider docs; the what-if preview will tell you when a change is destructive.
Some operations are inherently out-of-band: secret values in Key Vault, role assignments at non-managed scopes, data-plane operations like uploading a blob. These should not be in Bicep. Use scripts (or specialised tasks) to manage them, but accept that those operations are not idempotent in the same way.
The third gotcha is drift: someone clicks in the portal and changes a setting. Your Bicep is now out of sync with reality. The fix is twofold — disable portal write access in prod (Azure Policy can do this), and schedule a periodic what-if run that alerts on non-empty diffs.
- • Templates are safe to re-run; verified by re-running in CI
- • Destructive changes (rename, SKU change) reviewed manually
- • No portal edits in prod (policy enforced)
- • Nightly drift scan; deltas open PRs automatically
- • Out-of-band operations (secrets, blobs) clearly partitioned from Bicep
- • Recreate-on-change properties flagged in code review checklist
Region rollout patterns
A single-region prod deploy is a special case. The real shape of a safe deploy is a staged rollout across multiple regions or rings:
- Dev — your own team's environment, single region, no real users. Bake: minutes.
- Preprod — production-like data and config, internal testers only. Bake: hours.
- Canary — one production region, small fraction of real users. Bake: hours to a day.
- Broad — a wave of regions, larger user fraction. Bake: hours.
- Global — remaining regions. Bake: continuous monitoring.
Between each stage is a bake time: a window during which the deploy must not break the SLO. "Not break" is defined by quantitative health signals — error rate stays under a threshold, p95 latency stays under another, no critical alerts fire. If any signal breaches, the rollout halts and an alert wakes the on-call.
Auto-rollback is the response to a bad bake. It comes in two flavours. Slot-swap rollback, available in App Service, swaps the old version back; it takes seconds. Redeploy rollback, the general form, re-runs the deploy stage against the previous artifact; it takes as long as a normal deploy. The first is preferable; the second is the fallback.
The fan-out shape is usually wave-based: canary in one region, wave 1 in two non-paired regions, wave 2 in four, and so on. Pairing matters because Azure region pairs share some failure domains; spreading waves across non-paired regions limits blast radius. The exact wave shape depends on your customer geography and your SLO commitments.
- • Dev → preprod → canary → broad → global
- • 1-hour bakes minimum at canary; longer for prod waves
- • Health gate: error rate < baseline + N basis points
- • Slot-swap rollback wired in
- • Customer-aware rings, not just regions
- • Bake times derived from traffic volume (need N requests for stat-sig)
- • Auto-rollback drives a postmortem regardless of cause
- • Rollout-engine signals streamed to a central dashboard
Approval workflows and change management
Production deploys in regulated industries (finance, healthcare, anything touching customer money) require a change ticket: a record stating what is changing, why, who approved it, what the rollback plan is. The pipeline integrates by querying the change-management system before the prod stage runs; if the ticket is not approved or not in the deploy window, the gate blocks.
Even outside regulated contexts, the discipline is valuable. The change record is the post-incident artifact; the approval is the second pair of eyes; the deploy window prevents Friday-evening surprises. Azure DevOps environments support REST gates that call the change-management API; ServiceNow, Jira Service Management, and most enterprise tools have webhooks for this.
The lighter-weight version is the deploy calendar + freeze windows. Code freezes are a fact of life: Black Friday, end of quarter, regulatory deadlines. The pipeline must respect a freeze — typically by checking an org-wide "deploy allowed?" service before each prod stage.
Approvals are not just "click OK". They are an opportunity for a second engineer to inspect the deploy: what changed in the diff, what's in the artifact, what's the rollback plan. Treating approvals as paperwork is how you ship the wrong thing.
- • Manual approval on every prod stage
- • Business-hours check enforced
- • Freeze windows respected via a REST gate
- • Change-management ticket required, status verified by gate
- • Approver rota documented; conflict-of-interest rules
- • Auto-rollback skips approvals (correct: getting back to known-good is always allowed)
Ev2 (Microsoft-internal) in depth
Inside Microsoft, the canonical rollout orchestrator is Ev2 (Express v2). It is not in the public docs, but the concepts are: most large cloud orgs end up with something like it, and understanding it makes the analogues clearer.
An Ev2 rollout is described by a small set of YAML and JSON files in a service group:
- ServiceModel — the declarative description of the service: which regions it lives in, which roles (sub-services) it has, how they relate. The shape of the world.
- RolloutSpec — the rollout plan: which stages run, in what order, what each stage does, what gates exist between them.
- ScopeBindings — the mapping from logical names (
primaryRegion) to physical Azure subscriptions and resource groups. - Parameters — per-environment parameter files (prod parameters, preprod parameters).
- Templates — ARM templates (
.json) that each stage applies. - Scripts — PowerShell or Bash scripts that each stage runs for non-ARM operations (warmup, drain, custom checks).
A rollout consumes a build artifact (the pipeline output from chapter 18) and a rollout spec, then walks through the stages defined in the spec, applying ARM templates per scope and running scripts per scope. Each stage has a bake time and a health gate; failure halts the rollout and (depending on config) triggers a rollback.
Two flavours exist: Classic Ev2 (older, more verbose, ARM-only) and Modern Ev2 (newer, supports container images and Helm, slightly nicer authoring). New services start on Modern. The migration story is well-trodden.
The transferable lesson is the separation between service model and rollout plan. The model says what exists; the spec says how to change it. The same model can be rolled out by different specs for different scenarios (normal release, emergency patch, regional failover). Argo Rollouts (Kubernetes) makes the same split with its Rollout CRD vs AnalysisTemplate CRD; Spinnaker does it with its pipeline + canary-analysis stages.
- • Inside MS: know ServiceModel, RolloutSpec, ScopeBindings, Parameters, Templates, Scripts
- • Outside MS: pick one orchestrator (Spinnaker / Argo / Octopus) and learn it deeply
- • Separate "what exists" from "how we change it" in your design
- • Author both Classic + Modern Ev2 specs; understand migration tradeoffs
- • Custom health checks integrated with the rollout engine
- • Emergency-rollout specs that skip non-safety gates
Transferable analogues outside Microsoft
If you are not at Microsoft, the same problems are solved by:
- Spinnaker — Netflix-origin multi-cloud orchestrator. Pipelines with stages, canary analysis via Kayenta, deep AWS/GCP/Kubernetes integration. Heavy operationally; runs as a cluster of services.
- Argo Rollouts — Kubernetes-native progressive delivery.
RolloutCRD replacesDeployment; supports blue-green, canary, traffic-splitting via service meshes. Lightweight if you are already on Kubernetes. - Octopus Deploy — commercial tool popular in .NET shops. Project-based UI, strong environment promotion model, good story for Windows + Linux + container deploys. Less programmable than Spinnaker, faster to learn.
- GitHub Actions environments — for smaller orgs or single-product setups, environments with required reviewers, deployment branches, and custom workflows do most of what you need. Less specialised than Spinnaker but plenty for a 5–20 service org.
- Flux + Argo CD — GitOps. The desired state lives in Git; the cluster pulls and converges. Rollout patterns layer on top (Argo Rollouts on top of Argo CD).
Pick one, learn it well, and remember that the principles travel: separate model from rollout, bake before broaden, gate on health, automate the rollback.
Worked examples
A Bicep module with parameters and outputs
// modules/webapp.bicep
@description('Name of the App Service.')
param appName string
@description('Azure region.')
param location string = resourceGroup().location
@description('App Service plan SKU.')
@allowed([ 'B1', 'P1v3', 'P2v3' ])
param sku string = 'P1v3'
@description('App settings to inject (key-value pairs).')
param appSettings object = {}
resource plan 'Microsoft.Web/serverfarms@2024-04-01' = {
name: '${appName}-plan'
location: location
sku: { name: sku }
properties: { reserved: true }
}
resource app 'Microsoft.Web/sites@2024-04-01' = {
name: appName
location: location
identity: { type: 'SystemAssigned' }
properties: {
serverFarmId: plan.id
httpsOnly: true
siteConfig: {
linuxFxVersion: 'DOTNETCORE|8.0'
minTlsVersion: '1.2'
ftpsState: 'Disabled'
appSettings: [ for setting in items(appSettings): {
name: setting.key
value: setting.value
} ]
}
}
}
output appUrl string = 'https://${app.properties.defaultHostName}'
output principalId string = app.identity.principalId
What to notice:
@descriptionand@allowedare decorators that improve self-documentation and validate at compile time.identity: { type: 'SystemAssigned' }gives the app a managed identity; theprincipalIdoutput lets a downstream KV module grant it access.appSettingsis an object (key/value map) and converted to ARM's array shape via afor ... in items()loop.- Two outputs: the public URL (for smoke tests) and the principal ID (for RBAC).
Composing modules in a root template
// main.bicep
targetScope = 'resourceGroup'
param env string // 'dev' | 'staging' | 'prod'
param location string = resourceGroup().location
var appName = 'myapi-${env}'
module kv './modules/keyvault.bicep' = {
name: 'kv-deploy'
params: {
name: 'kv-myapp-${env}'
location: location
}
}
module web './modules/webapp.bicep' = {
name: 'web-deploy'
params: {
appName: appName
location: location
sku: env == 'prod' ? 'P2v3' : 'P1v3'
appSettings: {
'ConnectionStrings__Default': '@Microsoft.KeyVault(SecretUri=${kv.outputs.connStrUri})'
'ApplicationInsights__InstrumentationKey': '@Microsoft.KeyVault(SecretUri=${kv.outputs.aiKeyUri})'
}
}
}
// Grant the web app's managed identity Key Vault Secrets User
module rbac './modules/keyvault-rbac.bicep' = {
name: 'kv-rbac'
params: {
kvName: kv.outputs.name
principalId: web.outputs.principalId
roleDefinitionId: '4633458b-17de-408a-b874-0445c86b69e6' // Key Vault Secrets User
}
}
output appUrl string = web.outputs.appUrl
What to notice:
targetScopedeclared explicitly;az deployment group creatematches it.- Conditional SKU based on env: prod gets bigger.
- App settings use Key Vault references (
@Microsoft.KeyVault(...)) — the actual secret value never appears in Bicep or pipeline logs. - The RBAC module depends on outputs from KV (the resource ID) and the web app (the principal ID); Bicep infers the dependency graph automatically.
A pipeline that deploys infra then app
# infra pipeline (infra-pipelines.yml)
trigger:
branches: { include: [ main ] }
paths: { include: [ 'infra/**' ] }
stages:
- stage: WhatIf
jobs:
- job: WhatIfProd
steps:
- task: AzureCLI@2
inputs:
azureSubscription: 'svc-conn-prod-fed'
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
az deployment group what-if \
--resource-group rg-myapp-prod \
--template-file infra/main.bicep \
--parameters infra/prod.bicepparam
- stage: Deploy_Prod
dependsOn: WhatIf
jobs:
- deployment: DeployProd
environment: prod-infra
strategy:
runOnce:
deploy:
steps:
- task: AzureCLI@2
inputs:
azureSubscription: 'svc-conn-prod-fed'
scriptType: bash
scriptLocation: inlineScript
inlineScript: |
az deployment group create \
--resource-group rg-myapp-prod \
--template-file infra/main.bicep \
--parameters infra/prod.bicepparam \
--query properties.outputs > outputs.json
- publish: outputs.json
artifact: outputs
What to notice:
- Path filter limits to
infra/**changes; code changes do not trigger this pipeline. what-ifruns first as a separate stage; the deploy stage depends on it succeeding.- The deploy stage publishes the ARM outputs as an artifact for the app pipeline to consume.
- A dedicated
prod-infraenvironment carries its own approvals (typically tighter than app deploys).
A canary + waves rollout
- stage: Canary
jobs:
- deployment: CanaryEastUS2
environment: prod-eastus2
strategy:
runOnce:
deploy:
steps:
- task: AzureWebApp@1
inputs: { azureSubscription: 'svc-conn-prod-fed', appName: 'myapi-eastus2', package: $(Pipeline.Workspace)/api }
- script: ./scripts/smoke.sh https://myapi-eastus2.azurewebsites.net
displayName: Smoke
- script: ./scripts/bake-check.sh eastus2 3600
displayName: Bake (60min, error rate < baseline + 50bps)
- stage: Wave1
dependsOn: Canary
condition: succeeded()
jobs:
- deployment: WestEurope
environment: prod-westeurope
strategy: { runOnce: { deploy: { steps: [ ... ] } } }
- deployment: SouthAsia
environment: prod-southasia
strategy: { runOnce: { deploy: { steps: [ ... ] } } }
- stage: Wave2
dependsOn: Wave1
condition: succeeded()
jobs:
- deployment: AustraliaEast
environment: prod-australiaeast
strategy: { runOnce: { deploy: { steps: [ ... ] } } }
- deployment: WestUS3
environment: prod-westus3
strategy: { runOnce: { deploy: { steps: [ ... ] } } }
What to notice:
- Canary runs a smoke test, then a bake script that queries telemetry and exits non-zero if SLO breaches.
- Waves are parallel jobs within a stage; the stage gate is "all waves succeed".
condition: succeeded()on wave stages halts the rollout if canary fails.- Each region has its own environment, giving per-region deployment history.
Hands-on exercises
-
Goal: Write a Bicep module for App Service + plan. Steps: (1) Author
modules/webapp.bicepwith parameters for name, location, SKU. (2) Compose it from a rootmain.bicep. (3) Deploy to a personal Azure subscription. (4) Re-deploy unchanged; verify no-op. You're done whenaz deployment group createruns twice with identical output. -
Goal: Use
what-ifin a PR workflow. Steps: (1) Make a Bicep change in a feature branch. (2) Runaz deployment group what-iflocally; capture the output. (3) Configure a pipeline to run it on PR and post the diff as a comment. You're done when PR comments show the diff before merge. -
Goal: Split infra and app pipelines. Steps: (1) Move infra into its own pipeline with
paths: include: ['infra/**']. (2) Make the app pipeline declare the infra pipeline as apipelines:resource. (3) Trigger an infra change; verify the app pipeline waits. You're done when an infra deploy precedes the next app deploy. -
Goal: Demonstrate drift detection. Steps: (1) Deploy infra with Bicep. (2) Manually edit a property in the portal. (3) Run
what-if; observe the drift. (4) Either revert in portal or update Bicep + redeploy. You're done when drift is detected and resolved. -
Goal: Build a canary stage with bake. Steps: (1) Deploy to one region in a new stage called
Canary. (2) Add a script that queries Application Insights for error rate over the last 60 minutes and exits non-zero on breach. (3) Make the next stagedependsOn: Canarywithcondition: succeeded(). (4) Force a failure (deploy a bad version); verify the next stage does not run. You're done when canary failures halt the rollout. -
Goal: Implement an auto-rollback. Steps: (1) Capture the previous artifact's version (e.g., from a tag or release name). (2) On smoke-test failure post-deploy, run a re-deploy of the previous artifact. (3) Optionally use slot-swap for instant rollback. You're done when a deliberately broken deploy triggers a rollback within minutes.
Self-check questions
- Why use Bicep instead of writing ARM JSON directly? Give two concrete advantages.
- What does
targetScopedo, and what breaks if the CLI call's scope does not match? - Explain idempotency in the context of ARM deployments. Name a scenario where a Bicep apply is NOT idempotent.
- Why split infra and app deploys into separate pipelines? Give a failure mode that occurs when they share one.
- What is
az deployment ... what-if, and where in the workflow should it run? - Walk through a dev → preprod → canary → broad → global rollout. What does each stage prove?
- What is a bake time, and how do you decide its duration?
- Compare slot-swap rollback vs redeploy rollback. When does each apply?
- Inside Microsoft, what are ServiceModel, RolloutSpec, ScopeBindings, Parameters, Templates, and Scripts in Ev2? One sentence each.
- Outside Microsoft, name an analogue for each of: progressive delivery on Kubernetes, multi-cloud orchestrator, .NET-shop deployment tool.
- What is drift, and how do you detect and prevent it?
- Why does an auto-rollback skip the manual approval gate?
High-signal resources
Official docs
- Bicep documentation — language, modules, registry.
- ARM template reference — every resource type's schema.
- What-if for deployments.
- Azure Policy — enforce guardrails declaratively.
- Argo Rollouts — progressive delivery on Kubernetes.
- Spinnaker docs — multi-cloud orchestration.
Books or courses
- Infrastructure as Code — Kief Morris. The canonical text; principles outlive tools.
- Cloud Native Patterns — Cornelia Davis. Progressive delivery and rollout patterns in depth.
Practitioner posts
- Netflix Tech Blog on Spinnaker — the origin story and continued iteration.
- Honeycomb's deploy posts — production rollout incidents.
- Microsoft Tech Community: Bicep — tips and patterns.
- SRE Workbook chapter on canarying — Google's framing, broadly applicable.
Weekly milestones
- Day 1: Read Bicep basics; author a single-resource Bicep file; deploy it. Answer self-check 1, 2, 3.
- Day 2: Refactor into modules; add Key Vault + RBAC; deploy. Answer self-check 11.
- Day 3: Wire Bicep into a pipeline with
what-ifon PR; deploy on merge. Answer self-check 5. - Day 4-5: Split infra and app pipelines; chain with
pipelines:resource. Answer self-check 4. - Day 6-7: Build a canary + waves stage layout with bake + auto-rollback. Read about Ev2 (internally) or Argo/Spinnaker (externally). Answer self-check 6, 7, 8, 9, 10, 12.
How it shows up in the capstone
The capstone repo has an infra/ directory with Bicep modules (App Service, Key Vault, App Insights, log analytics workspace) composed from a main.bicep. A dedicated infra pipeline triggers on infra/** changes, runs what-if on PR, and applies on merge. The app pipeline declares the infra pipeline as a resource and consumes its outputs as artifacts.
The app's prod deploy is shaped as a canary + waves rollout: one region first with a 30-minute bake, then a wave of two non-paired regions, then global. Bake scripts query Application Insights for error rate and p95 latency. Any breach halts the rollout and triggers an auto-rollback via slot swap.
If you are at Microsoft, the capstone's rollout becomes a thin Ev2 spec; if you are elsewhere, it stays in Azure DevOps with environments and gates. The principles — separate model from rollout, bake before broaden, automate the rollback — are identical.
Previous chapter → Ch 18 — CI/CD with Azure DevOps
Next chapter → Ch 20 — Security baseline