Automating audio generation

From text , generate audio files and publishing them to webapp

Using Text-to-Audio Conversion Service to Publish Audio Content to a Web App

In this post, we'll walk through an approach to convert text into audio using Azure AI Speech Service and serve the generated audio files from a web application.


Method Provided by LLM

The following approach was generated with the assistance of an LLM (GitHub Copilot). The code implementation is left to the reader.

Architecture

Text Input → Azure AI Speech Service → .wav/.mp3 file → Azure Blob Storage → Web App

Approach — Step by Step

1. Provision an Azure AI Speech Resource

2. Set Up Your Python Environment

3. Write the Text-to-Audio Conversion Logic

4. Provision Azure Blob Storage

5. Upload the Generated Audio File

6. Serve the Audio in Your Web App

7. End-to-End Flow

Combine everything into a single function that: 1. Takes text input as a parameter 2. Synthesizes it to a local .wav file via Azure Speech 3. Uploads the file to Blob Storage 4. Cleans up the local file 5. Returns the public URL

Key Azure Services Used

Service Purpose
Azure AI Speech Converts text to natural-sounding audio using neural voices
Azure Blob Storage Hosts the generated audio files with public URL access

Cost Estimate

Service Free Tier Pay-as-you-go
Azure AI Speech 500K chars/month (F0) ~$1 per 1M chars (S0)
Azure Blob Storage 5 GB free for 12 months ~$0.02/GB/month

Important Notes

Cleanup

Delete the resource group when done to avoid charges:

az group delete --name rg-audio-gen --yes --no-wait

References


Background & Prerequisites — What You Need to Know Before Writing This Blog

To build a text-to-audio pipeline on Azure, you need to understand speech synthesis, cloud storage, and web serving. Below are the foundational topics.


1. Text-to-Speech (TTS) Fundamentals

Why: The core of this project is converting text to audio — you need to understand how modern TTS works. - Concatenative TTS — Older approach: record a voice actor speaking thousands of phoneme combinations, then concatenate matching segments at runtime. Sounds robotic at boundaries. - Neural TTS — Modern approach: deep neural networks (WaveNet, Tacotron, VITS) generate speech waveforms directly from text. Produces natural, human-like speech with proper intonation and prosody. - Azure Neural Voices — Microsoft's neural TTS offering. 400+ voices across 140+ languages. Voices are trained on hours of recorded speech from voice actors. Styles include "cheerful," "sad," "newscast," "customer service." - Phonemes & Prosody — Phonemes are the individual units of sound. Prosody covers rhythm, stress, and intonation. Neural TTS models learn prosody from training data but can be controlled via SSML. - Vocoder — Converts the model's intermediate representation (mel spectrogram) into an audio waveform. HiFi-GAN and WaveRNN are common vocoders.

2. SSML (Speech Synthesis Markup Language)

Why: For production-quality audio, plain text is not enough — SSML gives fine-grained control. - What is SSML — An XML-based markup language that controls how text is spoken. Supported by all major TTS engines (Azure, Google, AWS Polly). - Key tags — - <speak> — Root element - <voice> — Select a specific voice - <prosody> — Control rate, pitch, volume (e.g., <prosody rate="slow" pitch="+10%">) - <break> — Insert pauses (e.g., <break time="500ms"/>) - <emphasis> — Stress certain words - <say-as> — Control interpretation (dates, numbers, abbreviations: <say-as interpret-as="date">2026-02-22</say-as>) - <phoneme> — Specify exact pronunciation using IPA (International Phonetic Alphabet) - Multi-voice conversations — Use multiple <voice> tags within one SSML document to create dialogue-style audio.

3. Audio Formats & Encoding

Why: Choosing the right audio format affects file size, quality, and browser compatibility. - WAV (Waveform Audio) — Uncompressed, lossless. Large files (~10MB per minute of speech). Best quality but impractical for web serving. - MP3 — Compressed, lossy. ~1MB per minute at 128kbps. Universal browser support. Best for web delivery. - OGG/Opus — Open-source, excellent compression. Slightly better quality than MP3 at same bitrate. Not universally supported (no Safari on iOS). - Azure SDK output formats — Configurable via SpeechSynthesisOutputFormat enum. Options include Audio16Khz32KBitRateMonoMp3, Riff24Khz16BitMonoPcm, etc. Choose based on quality vs size tradeoff. - Sample rate — 16kHz is fine for speech. 24kHz or 48kHz for higher quality. Higher sample rate = larger file.

4. Azure Blob Storage Fundamentals

Why: Audio files need to be stored and served — Blob Storage is the hosting layer. - Storage account types — General-purpose v2 (recommended), BlobStorage (legacy). Always use GPv2. - Blob types — Block blobs (for files like audio), Append blobs (logs), Page blobs (VM disks). Audio files are block blobs. - Access tiers — Hot (frequent access, higher storage cost, lower access cost), Cool (infrequent, 30-day minimum), Archive (rare, hours to rehydrate). Audio served on a web app should be Hot. - Access levels — Private (default, requires SAS token or auth), Blob (public read for blobs only), Container (public read for container + blobs). For a simple web app, Blob-level access enables direct URL access. - SAS (Shared Access Signature) — Time-limited, scoped tokens for accessing private blobs. Better than making blobs public in production. - CDN integration — Azure CDN can cache blob content at edge locations worldwide. Reduces latency for audio playback. Configure a CDN profile pointing to the storage endpoint.

5. Azure AI Speech SDK

Why: The SDK is the primary tool for implementing text-to-audio conversion. - SpeechConfig — Configuration object holding the subscription key, region, voice name, and output format. Created once, reused across synthesis calls. - SpeechSynthesizer — The main class. Methods: speak_text_async() (plain text), speak_ssml_async() (SSML input). Can output to a file, audio stream, or speaker. - AudioOutputConfig — Controls where audio goes: file path, audio stream, or default speaker. For server-side processing, output to a file or memory stream. - Event-driven architecture — The SDK fires events: synthesis_started, synthesizing (partial audio chunks), synthesis_completed, synthesis_canceled. Use these for progress tracking and error handling. - Error handling — Check SpeechSynthesisResult.reason. Values: ResultReason.SynthesizingAudioCompleted (success), ResultReason.Canceled (failure). On cancellation, inspect CancellationDetails for error code and message. - Batch synthesis — For large documents, split into paragraphs, synthesize each, then concatenate audio files (using pydub or ffmpeg).

6. Web App Integration

Why: The generated audio needs to be served and played in a browser. - HTML5 <audio> element — Native browser audio player. Supports MP3, WAV, OGG. Use controls attribute for play/pause/seek UI. Use preload="metadata" to load duration without downloading full file. - Streaming vs Download — For short clips (<5 min), direct blob URL works. For longer audio, consider Azure Media Services or progressive download with range requests. - Flask integration — Create a route that accepts text input (POST), calls the synthesis pipeline, uploads to Blob Storage, and returns the audio URL. Render the audio player in a Jinja2 template. - Responsive design — The audio player should work on mobile. HTML5 <audio> is responsive by default but style with CSS for consistency.


TODO / Remaining Work


Implementation

Back to Blog About the Author
🧘