Using Text-to-Audio Conversion Service to Publish Audio Content to a Web App
In this post, we'll walk through an approach to convert text into audio using Azure AI Speech Service and serve the generated audio files from a web application.
Method Provided by LLM
The following approach was generated with the assistance of an LLM (GitHub Copilot). The code implementation is left to the reader.
Architecture
Text Input → Azure AI Speech Service → .wav/.mp3 file → Azure Blob Storage → Web App
Approach — Step by Step
1. Provision an Azure AI Speech Resource
- Go to the Azure Portal → Create a resource → search Speech → select Speech by Microsoft
- Choose a Resource group (or create one, e.g.
rg-audio-gen) - Pick a Region close to you (e.g.
East US) - Select Pricing tier: Free F0 for testing, or Standard S0 for production
- After deployment, navigate to the resource and note down the Key and Region from the Keys and Endpoint blade
- Alternatively, use the Azure CLI:
az cognitiveservices account createwith--kind SpeechServices
2. Set Up Your Python Environment
- Install the Azure Speech SDK package:
azure-cognitiveservices-speech - Store your Speech key and region as environment variables (
AZURE_SPEECH_KEY,AZURE_SPEECH_REGION) — never hard-code secrets
3. Write the Text-to-Audio Conversion Logic
- Create a
SpeechConfigobject using your key and region - Set the voice using
speech_synthesis_voice_name— Azure supports 400+ neural voices across 140+ languages (see voice gallery) - Create an
AudioOutputConfigpointing to a local.wavfile - Instantiate a
SpeechSynthesizerand callspeak_text_async()with your input text - Handle both the success case (
SynthesizingAudioCompleted) and the cancellation/error case - The SDK writes the audio directly to the specified output file
4. Provision Azure Blob Storage
- Create a Storage Account in the same resource group (e.g. via
az storage account create) - Create a Blob Container (e.g.
audio-files) with public blob-level read access so audio URLs are directly accessible - Install the
azure-storage-blobPython package - Store the storage connection string as an environment variable (
AZURE_STORAGE_CONNECTION_STRING)
5. Upload the Generated Audio File
- Use
BlobServiceClient.from_connection_string()to connect - Get a
BlobClientfor your container and target blob name - Open the local
.wavfile in binary mode and callupload_blob()withoverwrite=True - The blob client exposes a
.urlproperty — this is the public URL of your audio file - Optionally delete the local file after upload to save disk space
6. Serve the Audio in Your Web App
- The uploaded blob will have a public URL like:
https://<account>.blob.core.windows.net/audio-files/output.wav - Embed it in HTML using the
<audio controls>element with a<source>tag - In Flask (this blog's stack), you can create a route that renders the audio player with the blob URL passed as a template variable
7. End-to-End Flow
Combine everything into a single function that:
1. Takes text input as a parameter
2. Synthesizes it to a local .wav file via Azure Speech
3. Uploads the file to Blob Storage
4. Cleans up the local file
5. Returns the public URL
Key Azure Services Used
| Service | Purpose |
|---|---|
| Azure AI Speech | Converts text to natural-sounding audio using neural voices |
| Azure Blob Storage | Hosts the generated audio files with public URL access |
Cost Estimate
| Service | Free Tier | Pay-as-you-go |
|---|---|---|
| Azure AI Speech | 500K chars/month (F0) | ~$1 per 1M chars (S0) |
| Azure Blob Storage | 5 GB free for 12 months | ~$0.02/GB/month |
Important Notes
- Security: Use environment variables for all keys and connection strings. Consider Azure Key Vault for production.
- Voice selection: Experiment with different
speech_synthesis_voice_namevalues. Multi-lingual voices (e.g.en-US-JennyMultilingualNeural) can handle mixed-language text. - Output format: The SDK defaults to WAV. For smaller files, configure MP3 output via
speech_config.set_speech_synthesis_output_format(). - SSML: For finer control over pronunciation, pauses, pitch, and speed, use SSML markup instead of plain text with
speak_ssml_async(). - Batch processing: For large volumes of text, consider splitting into chunks and processing in parallel.
Cleanup
Delete the resource group when done to avoid charges:
az group delete --name rg-audio-gen --yes --no-wait
References
Background & Prerequisites — What You Need to Know Before Writing This Blog
To build a text-to-audio pipeline on Azure, you need to understand speech synthesis, cloud storage, and web serving. Below are the foundational topics.
1. Text-to-Speech (TTS) Fundamentals
Why: The core of this project is converting text to audio — you need to understand how modern TTS works. - Concatenative TTS — Older approach: record a voice actor speaking thousands of phoneme combinations, then concatenate matching segments at runtime. Sounds robotic at boundaries. - Neural TTS — Modern approach: deep neural networks (WaveNet, Tacotron, VITS) generate speech waveforms directly from text. Produces natural, human-like speech with proper intonation and prosody. - Azure Neural Voices — Microsoft's neural TTS offering. 400+ voices across 140+ languages. Voices are trained on hours of recorded speech from voice actors. Styles include "cheerful," "sad," "newscast," "customer service." - Phonemes & Prosody — Phonemes are the individual units of sound. Prosody covers rhythm, stress, and intonation. Neural TTS models learn prosody from training data but can be controlled via SSML. - Vocoder — Converts the model's intermediate representation (mel spectrogram) into an audio waveform. HiFi-GAN and WaveRNN are common vocoders.
2. SSML (Speech Synthesis Markup Language)
Why: For production-quality audio, plain text is not enough — SSML gives fine-grained control.
- What is SSML — An XML-based markup language that controls how text is spoken. Supported by all major TTS engines (Azure, Google, AWS Polly).
- Key tags —
- <speak> — Root element
- <voice> — Select a specific voice
- <prosody> — Control rate, pitch, volume (e.g., <prosody rate="slow" pitch="+10%">)
- <break> — Insert pauses (e.g., <break time="500ms"/>)
- <emphasis> — Stress certain words
- <say-as> — Control interpretation (dates, numbers, abbreviations: <say-as interpret-as="date">2026-02-22</say-as>)
- <phoneme> — Specify exact pronunciation using IPA (International Phonetic Alphabet)
- Multi-voice conversations — Use multiple <voice> tags within one SSML document to create dialogue-style audio.
3. Audio Formats & Encoding
Why: Choosing the right audio format affects file size, quality, and browser compatibility.
- WAV (Waveform Audio) — Uncompressed, lossless. Large files (~10MB per minute of speech). Best quality but impractical for web serving.
- MP3 — Compressed, lossy. ~1MB per minute at 128kbps. Universal browser support. Best for web delivery.
- OGG/Opus — Open-source, excellent compression. Slightly better quality than MP3 at same bitrate. Not universally supported (no Safari on iOS).
- Azure SDK output formats — Configurable via SpeechSynthesisOutputFormat enum. Options include Audio16Khz32KBitRateMonoMp3, Riff24Khz16BitMonoPcm, etc. Choose based on quality vs size tradeoff.
- Sample rate — 16kHz is fine for speech. 24kHz or 48kHz for higher quality. Higher sample rate = larger file.
4. Azure Blob Storage Fundamentals
Why: Audio files need to be stored and served — Blob Storage is the hosting layer. - Storage account types — General-purpose v2 (recommended), BlobStorage (legacy). Always use GPv2. - Blob types — Block blobs (for files like audio), Append blobs (logs), Page blobs (VM disks). Audio files are block blobs. - Access tiers — Hot (frequent access, higher storage cost, lower access cost), Cool (infrequent, 30-day minimum), Archive (rare, hours to rehydrate). Audio served on a web app should be Hot. - Access levels — Private (default, requires SAS token or auth), Blob (public read for blobs only), Container (public read for container + blobs). For a simple web app, Blob-level access enables direct URL access. - SAS (Shared Access Signature) — Time-limited, scoped tokens for accessing private blobs. Better than making blobs public in production. - CDN integration — Azure CDN can cache blob content at edge locations worldwide. Reduces latency for audio playback. Configure a CDN profile pointing to the storage endpoint.
5. Azure AI Speech SDK
Why: The SDK is the primary tool for implementing text-to-audio conversion.
- SpeechConfig — Configuration object holding the subscription key, region, voice name, and output format. Created once, reused across synthesis calls.
- SpeechSynthesizer — The main class. Methods: speak_text_async() (plain text), speak_ssml_async() (SSML input). Can output to a file, audio stream, or speaker.
- AudioOutputConfig — Controls where audio goes: file path, audio stream, or default speaker. For server-side processing, output to a file or memory stream.
- Event-driven architecture — The SDK fires events: synthesis_started, synthesizing (partial audio chunks), synthesis_completed, synthesis_canceled. Use these for progress tracking and error handling.
- Error handling — Check SpeechSynthesisResult.reason. Values: ResultReason.SynthesizingAudioCompleted (success), ResultReason.Canceled (failure). On cancellation, inspect CancellationDetails for error code and message.
- Batch synthesis — For large documents, split into paragraphs, synthesize each, then concatenate audio files (using pydub or ffmpeg).
6. Web App Integration
Why: The generated audio needs to be served and played in a browser.
- HTML5 <audio> element — Native browser audio player. Supports MP3, WAV, OGG. Use controls attribute for play/pause/seek UI. Use preload="metadata" to load duration without downloading full file.
- Streaming vs Download — For short clips (<5 min), direct blob URL works. For longer audio, consider Azure Media Services or progressive download with range requests.
- Flask integration — Create a route that accepts text input (POST), calls the synthesis pipeline, uploads to Blob Storage, and returns the audio URL. Render the audio player in a Jinja2 template.
- Responsive design — The audio player should work on mobile. HTML5 <audio> is responsive by default but style with CSS for consistency.
TODO / Remaining Work
- [ ] Implement the text-to-audio conversion script using Azure Speech SDK
- [ ] Test with different neural voices and compare quality
- [ ] Implement SSML support for fine-grained control
- [ ] Set up Azure Blob Storage and implement upload logic
- [ ] Build the Flask route and audio player page
- [ ] Add batch processing for long documents
- [ ] Document cost analysis with real usage numbers
- [ ] Add architecture diagram (Mermaid) of the full pipeline
- [ ] Add screenshots of the working web app
- [ ] Change status from
workinprogresstopublished