Using Text-to-Audio Conversion Service to Publish Audio Content to a Web App
In this post, we'll walk through an approach to convert text into audio using Azure AI Speech Service and serve the generated audio files from a web application.
Method Provided by LLM
The following approach was generated with the assistance of an LLM (GitHub Copilot). The code implementation is left to the reader.
Architecture
Text Input → Azure AI Speech Service → .wav/.mp3 file → Azure Blob Storage → Web App
Approach — Step by Step
1. Provision an Azure AI Speech Resource
- Go to the Azure Portal → Create a resource → search Speech → select Speech by Microsoft
- Choose a Resource group (or create one, e.g.
rg-audio-gen) - Pick a Region close to you (e.g.
East US) - Select Pricing tier: Free F0 for testing, or Standard S0 for production
- After deployment, navigate to the resource and note down the Key and Region from the Keys and Endpoint blade
- Alternatively, use the Azure CLI:
az cognitiveservices account createwith--kind SpeechServices
2. Set Up Your Python Environment
- Install the Azure Speech SDK package:
azure-cognitiveservices-speech - Store your Speech key and region as environment variables (
AZURE_SPEECH_KEY,AZURE_SPEECH_REGION) — never hard-code secrets
3. Write the Text-to-Audio Conversion Logic
- Create a
SpeechConfigobject using your key and region - Set the voice using
speech_synthesis_voice_name— Azure supports 400+ neural voices across 140+ languages (see voice gallery) - Create an
AudioOutputConfigpointing to a local.wavfile - Instantiate a
SpeechSynthesizerand callspeak_text_async()with your input text - Handle both the success case (
SynthesizingAudioCompleted) and the cancellation/error case - The SDK writes the audio directly to the specified output file
4. Provision Azure Blob Storage
- Create a Storage Account in the same resource group (e.g. via
az storage account create) - Create a Blob Container (e.g.
audio-files) with public blob-level read access so audio URLs are directly accessible - Install the
azure-storage-blobPython package - Store the storage connection string as an environment variable (
AZURE_STORAGE_CONNECTION_STRING)
5. Upload the Generated Audio File
- Use
BlobServiceClient.from_connection_string()to connect - Get a
BlobClientfor your container and target blob name - Open the local
.wavfile in binary mode and callupload_blob()withoverwrite=True - The blob client exposes a
.urlproperty — this is the public URL of your audio file - Optionally delete the local file after upload to save disk space
6. Serve the Audio in Your Web App
- The uploaded blob will have a public URL like:
https://<account>.blob.core.windows.net/audio-files/output.wav - Embed it in HTML using the
<audio controls>element with a<source>tag - In Flask (this blog's stack), you can create a route that renders the audio player with the blob URL passed as a template variable
7. End-to-End Flow
Combine everything into a single function that:
1. Takes text input as a parameter
2. Synthesizes it to a local .wav file via Azure Speech
3. Uploads the file to Blob Storage
4. Cleans up the local file
5. Returns the public URL
Key Azure Services Used
| Service | Purpose | |---|---| | Azure AI Speech | Converts text to natural-sounding audio using neural voices | | Azure Blob Storage | Hosts the generated audio files with public URL access |
Cost Estimate
| Service | Free Tier | Pay-as-you-go | |---|---|---| | Azure AI Speech | 500K chars/month (F0) | ~$1 per 1M chars (S0) | | Azure Blob Storage | 5 GB free for 12 months | ~$0.02/GB/month |
Important Notes
- Security: Use environment variables for all keys and connection strings. Consider Azure Key Vault for production.
- Voice selection: Experiment with different
speech_synthesis_voice_namevalues. Multi-lingual voices (e.g.en-US-JennyMultilingualNeural) can handle mixed-language text. - Output format: The SDK defaults to WAV. For smaller files, configure MP3 output via
speech_config.set_speech_synthesis_output_format(). - SSML: For finer control over pronunciation, pauses, pitch, and speed, use SSML markup instead of plain text with
speak_ssml_async(). - Batch processing: For large volumes of text, consider splitting into chunks and processing in parallel.
Cleanup
Delete the resource group when done to avoid charges:
az group delete --name rg-audio-gen --yes --no-wait
References
--