Deploying a Python Script as an API on Azure
Goal: Take a local PDF metadata extraction script and deploy it as a production-ready REST API on Azure.
π What This Blog Covers
flowchart LR
A["π Local Python Script"] --> B["π REST API
(FastAPI/Flask)"] B --> C["π³ Docker Container"] C --> D["βοΈ Azure Deployment"] D --> E["π Auth + Monitoring"]
(FastAPI/Flask)"] B --> C["π³ Docker Container"] C --> D["βοΈ Azure Deployment"] D --> E["π Auth + Monitoring"]
- Sample local script to extract content from PDF
- Available options to deploy the script as an API
- Designing the deployment architecture
- Implementation and testing
Script resources: GitHub Repo
π§ Background & Prerequisites
1. PDF Content Extraction β The Script
| Library | Strengths | Best For |
|---|---|---|
| PyMuPDF (fitz) | Fastest, handles complex layouts | General text extraction |
| pdfplumber | Excellent table extraction | Tabular data |
| PyPDF2 | Lightweight, basic extraction | Simple PDFs |
| Tesseract OCR | Open-source OCR for scanned PDFs | Image-based PDFs |
| Azure Doc Intelligence | Cloud-based, layout analysis | Enterprise extraction |
Key extraction targets:
- π Text β Full-text extraction from each page
- π Tables β Structured table data
- π·οΈ Metadata β Title, author, creation date, page count (
PdfReader(file).metadata) - π OCR fallback β Detect if PDF has extractable text; if not, route to OCR
2. API Design
sequenceDiagram
participant Client
participant API as FastAPI Server
participant Extractor as PDF Extractor
Client->>API: POST /api/extract
(multipart/form-data) API->>API: Validate file type & size API->>Extractor: Extract text + metadata Extractor-->>API: Structured result API-->>Client: 200 JSON response
{text, metadata, pages, time_ms}
(multipart/form-data) API->>API: Validate file type & size API->>Extractor: Extract text + metadata Extractor-->>API: Structured result API-->>Client: 200 JSON response
{text, metadata, pages, time_ms}
| Aspect | Design Decision |
|---|---|
| Endpoint | POST /api/extract β upload PDF, get extracted data |
| Input | multipart/form-data file upload (or URL download) |
| Output | JSON: {text, metadata, page_count, tables, processing_time_ms} |
| Validation | File type check, size limits, malware considerations |
| Status Codes | 200 success, 400 bad request, 413 too large, 500 error |
| Docs | OpenAPI/Swagger auto-generated (built into FastAPI) |
| Auth | API key in header or Azure Entra ID token |
3. Framework Choice β Flask vs FastAPI
| Feature | Flask | FastAPI |
|---|---|---|
| Style | WSGI (sync) | ASGI (async) |
| Auto docs | Extension needed | Built-in Swagger UI |
| Validation | Manual | Pydantic models |
| Performance | Good | Excellent |
| Learning curve | Minimal | Minimal |
| Verdict | β Use if integrating into existing Flask app | β Recommended for new APIs |
4. Containerization with Docker
# Example multi-stage Dockerfile
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY . .
EXPOSE 8000
HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
π‘ Tips: Use multi-stage builds to reduce image size. Install system deps like
poppler-utilsfor pdftotext. Pass secrets via environment variables β never hardcode.
5. Azure Deployment Options
graph TD
Script["π Python Script"] --> Container["π³ Docker Image"]
Container --> ACR["π¦ Azure Container Registry"]
ACR --> ACA["β Container Apps
(Recommended)"] ACR --> AppService["π App Service"] ACR --> Functions["β‘ Azure Functions"] ACR --> ACI["π¦ Container Instances"] ACR --> AKS["βΈοΈ AKS"]
(Recommended)"] ACR --> AppService["π App Service"] ACR --> Functions["β‘ Azure Functions"] ACR --> ACI["π¦ Container Instances"] ACR --> AKS["βΈοΈ AKS"]
| Option | Scale to Zero | Complexity | Best For | Monthly Cost |
|---|---|---|---|---|
| Container Apps β | β | Low | Variable-traffic APIs | Pay per use |
| App Service | β | Low | Steady-traffic APIs | ~$13+ (B1) |
| Azure Functions | β | Low | Infrequent calls | Free tier available |
| Container Instances | β | Minimal | Testing/one-off jobs | Pay per second |
| AKS | β | High | Multi-service architectures |
|
β Recommendation: Azure Container Apps β best balance of simplicity, cost, and zero-to-scale capabilities.
6. CI/CD Pipeline
flowchart LR
Push["π€ Git Push"] --> Build["ποΈ GitHub Actions"]
Build --> Image["π³ Build Docker Image"]
Image --> ACR["π¦ Push to ACR"]
ACR --> Deploy["π Deploy to
Container Apps"]
Container Apps"]
- Use GitHub Actions for automated build β push β deploy
- Store Azure credentials and API keys in GitHub Secrets
- Maintain separate staging and production environments
β TODO β Remaining Work
| # | Task | Priority |
|---|---|---|
| 1 | Write PDF extraction script (PyMuPDF: metadata + text + tables) | π΄ High |
| 2 | Wrap in FastAPI with endpoints, validation, error handling | π΄ High |
| 3 | Add OpenAPI/Swagger documentation | π΄ High |
| 4 | Write Dockerfile and test locally | π΄ High |
| 5 | Push image to Azure Container Registry | π‘ Medium |
| 6 | Deploy to Azure Container Apps | π‘ Medium |
| 7 | Set up GitHub Actions CI/CD pipeline | π‘ Medium |
| 8 | Add authentication (API key or Azure Entra ID) | π‘ Medium |
| 9 | Load test with sample PDFs and document performance | π’ Low |
| 10 | Create full architecture diagram with all components | π’ Low |
π§© Reference Implementation β FastAPI Service
A minimal, production-leaning implementation of the PDF extraction API:
# main.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
import fitz # PyMuPDF
import time, os
MAX_BYTES = 25 * 1024 * 1024 # 25 MB
app = FastAPI(title="PDF Extract API", version="1.0.0")
@app.get("/health")
def health():
return {"status": "ok", "version": os.getenv("APP_VERSION", "dev")}
@app.post("/api/extract")
async def extract(file: UploadFile = File(...)):
if file.content_type not in ("application/pdf", "application/octet-stream"):
raise HTTPException(status_code=400, detail="Only PDF accepted")
data = await file.read()
if len(data) > MAX_BYTES:
raise HTTPException(status_code=413, detail="File too large")
t0 = time.perf_counter()
try:
doc = fitz.open(stream=data, filetype="pdf")
except Exception as e:
raise HTTPException(status_code=400, detail=f"Invalid PDF: {e}")
pages = [p.get_text() for p in doc]
meta = doc.metadata or {}
result = {
"text": "\n\n".join(pages),
"page_count": doc.page_count,
"metadata": {
"title": meta.get("title"),
"author": meta.get("author"),
"creation_date": meta.get("creationDate"),
},
"processing_time_ms": int((time.perf_counter() - t0) * 1000),
}
doc.close()
return JSONResponse(result)
Deploying to Azure Container Apps
# 1. Build and push
ACR=myblogacr
az acr build -r $ACR -t pdf-extract:v1 .
# 2. Create environment and deploy
az containerapp env create -g rg-pdf -n cae-pdf -l eastus
az containerapp create \
-g rg-pdf -n ca-pdf-extract \
--environment cae-pdf \
--image $ACR.azurecr.io/pdf-extract:v1 \
--registry-server $ACR.azurecr.io \
--registry-identity system \
--target-port 8000 --ingress external \
--min-replicas 0 --max-replicas 5 \
--cpu 0.5 --memory 1Gi
Testing
curl -X POST -F "file=@sample.pdf" https://<app-url>/api/extract | jq '.page_count'
When all TODO items above are ticked and the /api/extract endpoint handles 100 concurrent uploads without errors, flip status: workinprogress β status: published.