Deploying a Python Script as an API on Azure
Goal: Take a local PDF metadata extraction script and deploy it as a production-ready REST API on Azure.
π What This Blog Covers
flowchart LR
A["π Local Python Script"] --> B["π REST API
(FastAPI/Flask)"] B --> C["π³ Docker Container"] C --> D["βοΈ Azure Deployment"] D --> E["π Auth + Monitoring"]
(FastAPI/Flask)"] B --> C["π³ Docker Container"] C --> D["βοΈ Azure Deployment"] D --> E["π Auth + Monitoring"]
- Sample local script to extract content from PDF
- Available options to deploy the script as an API
- Designing the deployment architecture
- Implementation and testing
Script resources: GitHub Repo
π§ Background & Prerequisites
1. PDF Content Extraction β The Script
| Library | Strengths | Best For |
|---|---|---|
| PyMuPDF (fitz) | Fastest, handles complex layouts | General text extraction |
| pdfplumber | Excellent table extraction | Tabular data |
| PyPDF2 | Lightweight, basic extraction | Simple PDFs |
| Tesseract OCR | Open-source OCR for scanned PDFs | Image-based PDFs |
| Azure Doc Intelligence | Cloud-based, layout analysis | Enterprise extraction |
Key extraction targets:
- π Text β Full-text extraction from each page
- π Tables β Structured table data
- π·οΈ Metadata β Title, author, creation date, page count (
PdfReader(file).metadata) - π OCR fallback β Detect if PDF has extractable text; if not, route to OCR
2. API Design
sequenceDiagram
participant Client
participant API as FastAPI Server
participant Extractor as PDF Extractor
Client->>API: POST /api/extract
(multipart/form-data) API->>API: Validate file type & size API->>Extractor: Extract text + metadata Extractor-->>API: Structured result API-->>Client: 200 JSON response
{text, metadata, pages, time_ms}
(multipart/form-data) API->>API: Validate file type & size API->>Extractor: Extract text + metadata Extractor-->>API: Structured result API-->>Client: 200 JSON response
{text, metadata, pages, time_ms}
| Aspect | Design Decision |
|---|---|
| Endpoint | POST /api/extract β upload PDF, get extracted data |
| Input | multipart/form-data file upload (or URL download) |
| Output | JSON: {text, metadata, page_count, tables, processing_time_ms} |
| Validation | File type check, size limits, malware considerations |
| Status Codes | 200 success, 400 bad request, 413 too large, 500 error |
| Docs | OpenAPI/Swagger auto-generated (built into FastAPI) |
| Auth | API key in header or Azure Entra ID token |
3. Framework Choice β Flask vs FastAPI
| Feature | Flask | FastAPI |
|---|---|---|
| Style | WSGI (sync) | ASGI (async) |
| Auto docs | Extension needed | Built-in Swagger UI |
| Validation | Manual | Pydantic models |
| Performance | Good | Excellent |
| Learning curve | Minimal | Minimal |
| Verdict | β Use if integrating into existing Flask app | β Recommended for new APIs |
4. Containerization with Docker
# Example multi-stage Dockerfile
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY . .
EXPOSE 8000
HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
π‘ Tips: Use multi-stage builds to reduce image size. Install system deps like
poppler-utilsfor pdftotext. Pass secrets via environment variables β never hardcode.
5. Azure Deployment Options
graph TD
Script["π Python Script"] --> Container["π³ Docker Image"]
Container --> ACR["π¦ Azure Container Registry"]
ACR --> ACA["β Container Apps
(Recommended)"] ACR --> AppService["π App Service"] ACR --> Functions["β‘ Azure Functions"] ACR --> ACI["π¦ Container Instances"] ACR --> AKS["βΈοΈ AKS"]
(Recommended)"] ACR --> AppService["π App Service"] ACR --> Functions["β‘ Azure Functions"] ACR --> ACI["π¦ Container Instances"] ACR --> AKS["βΈοΈ AKS"]
| Option | Scale to Zero | Complexity | Best For | Monthly Cost |
|---|---|---|---|---|
| Container Apps β | β | Low | Variable-traffic APIs | Pay per use |
| App Service | β | Low | Steady-traffic APIs | ~$13+ (B1) |
| Azure Functions | β | Low | Infrequent calls | Free tier available |
| Container Instances | β | Minimal | Testing/one-off jobs | Pay per second |
| AKS | β | High | Multi-service architectures | $$$$ |
β Recommendation: Azure Container Apps β best balance of simplicity, cost, and zero-to-scale capabilities.
6. CI/CD Pipeline
flowchart LR
Push["π€ Git Push"] --> Build["ποΈ GitHub Actions"]
Build --> Image["π³ Build Docker Image"]
Image --> ACR["π¦ Push to ACR"]
ACR --> Deploy["π Deploy to
Container Apps"]
Container Apps"]
- Use GitHub Actions for automated build β push β deploy
- Store Azure credentials and API keys in GitHub Secrets
- Maintain separate staging and production environments
β TODO β Remaining Work
| # | Task | Priority |
|---|---|---|
| 1 | Write PDF extraction script (PyMuPDF: metadata + text + tables) | π΄ High |
| 2 | Wrap in FastAPI with endpoints, validation, error handling | π΄ High |
| 3 | Add OpenAPI/Swagger documentation | π΄ High |
| 4 | Write Dockerfile and test locally | π΄ High |
| 5 | Push image to Azure Container Registry | π‘ Medium |
| 6 | Deploy to Azure Container Apps | π‘ Medium |
| 7 | Set up GitHub Actions CI/CD pipeline | π‘ Medium |
| 8 | Add authentication (API key or Azure Entra ID) | π‘ Medium |
| 9 | Load test with sample PDFs and document performance | π’ Low |
| 10 | Create full architecture diagram with all components | π’ Low |