Search Tech Journey

Find topics, journeys and posts

back to blog
cloudbeginner 5m2025-10-04

Deploying scripts as an API in Azure

Deploying local python scripts and converting them as an API.

Deploying a Python Script as an API on Azure

Goal: Take a local PDF metadata extraction script and deploy it as a production-ready REST API on Azure.


πŸ“‹ What This Blog Covers

flowchart LR
    A["πŸ“„ Local Python Script"] --> B["🌐 REST API<br/>(FastAPI/Flask)"]
    B --> C["🐳 Docker Container"]
    C --> D["☁️ Azure Deployment"]
    D --> E["πŸ”’ Auth + Monitoring"]
  1. Sample local script to extract content from PDF
  2. Available options to deploy the script as an API
  3. Designing the deployment architecture
  4. Implementation and testing

Script resources: GitHub Repo


πŸ”§ Background & Prerequisites

1. PDF Content Extraction β€” The Script

LibraryStrengthsBest For
PyMuPDF (fitz)Fastest, handles complex layoutsGeneral text extraction
pdfplumberExcellent table extractionTabular data
PyPDF2Lightweight, basic extractionSimple PDFs
Tesseract OCROpen-source OCR for scanned PDFsImage-based PDFs
Azure Doc IntelligenceCloud-based, layout analysisEnterprise extraction

Key extraction targets:

  • πŸ“ Text β€” Full-text extraction from each page
  • πŸ“Š Tables β€” Structured table data
  • 🏷️ Metadata β€” Title, author, creation date, page count (PdfReader(file).metadata)
  • πŸ” OCR fallback β€” Detect if PDF has extractable text; if not, route to OCR

2. API Design

sequenceDiagram
    participant Client
    participant API as FastAPI Server
    participant Extractor as PDF Extractor

    Client->>API: POST /api/extract<br/>(multipart/form-data)
    API->>API: Validate file type & size
    API->>Extractor: Extract text + metadata
    Extractor-->>API: Structured result
    API-->>Client: 200 JSON response<br/>{text, metadata, pages, time_ms}
AspectDesign Decision
EndpointPOST /api/extract β€” upload PDF, get extracted data
Inputmultipart/form-data file upload (or URL download)
OutputJSON: {text, metadata, page_count, tables, processing_time_ms}
ValidationFile type check, size limits, malware considerations
Status Codes200 success, 400 bad request, 413 too large, 500 error
DocsOpenAPI/Swagger auto-generated (built into FastAPI)
AuthAPI key in header or Azure Entra ID token

3. Framework Choice β€” Flask vs FastAPI

FeatureFlaskFastAPI
StyleWSGI (sync)ASGI (async)
Auto docsExtension neededBuilt-in Swagger UI
ValidationManualPydantic models
PerformanceGoodExcellent
Learning curveMinimalMinimal
Verdictβœ… Use if integrating into existing Flask appβœ… Recommended for new APIs

4. Containerization with Docker

# Example multi-stage Dockerfile
FROM python:3.12-slim AS builder
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

FROM python:3.12-slim
WORKDIR /app
COPY --from=builder /usr/local/lib/python3.12/site-packages /usr/local/lib/python3.12/site-packages
COPY . .
EXPOSE 8000
HEALTHCHECK CMD curl -f http://localhost:8000/health || exit 1
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

πŸ’‘ Tips: Use multi-stage builds to reduce image size. Install system deps like poppler-utils for pdftotext. Pass secrets via environment variables β€” never hardcode.


5. Azure Deployment Options

graph TD
    Script["🐍 Python Script"] --> Container["🐳 Docker Image"]
    Container --> ACR["πŸ“¦ Azure Container Registry"]
    ACR --> ACA["⭐ Container Apps<br/>(Recommended)"]
    ACR --> AppService["🌐 App Service"]
    ACR --> Functions["⚑ Azure Functions"]
    ACR --> ACI["πŸ“¦ Container Instances"]
    ACR --> AKS["☸️ AKS"]
OptionScale to ZeroComplexityBest ForMonthly Cost
Container Apps β­βœ…LowVariable-traffic APIsPay per use
App Service❌LowSteady-traffic APIs~$13+ (B1)
Azure Functionsβœ…LowInfrequent callsFree tier available
Container Instances❌MinimalTesting/one-off jobsPay per second
AKS❌HighMulti-service architectures$$$$

⭐ Recommendation: Azure Container Apps β€” best balance of simplicity, cost, and zero-to-scale capabilities.


6. CI/CD Pipeline

flowchart LR
    Push["πŸ“€ Git Push"] --> Build["πŸ—οΈ GitHub Actions"]
    Build --> Image["🐳 Build Docker Image"]
    Image --> ACR["πŸ“¦ Push to ACR"]
    ACR --> Deploy["πŸš€ Deploy to<br/>Container Apps"]
  • Use GitHub Actions for automated build β†’ push β†’ deploy
  • Store Azure credentials and API keys in GitHub Secrets
  • Maintain separate staging and production environments

βœ… TODO β€” Remaining Work

#TaskPriority
1Write PDF extraction script (PyMuPDF: metadata + text + tables)πŸ”΄ High
2Wrap in FastAPI with endpoints, validation, error handlingπŸ”΄ High
3Add OpenAPI/Swagger documentationπŸ”΄ High
4Write Dockerfile and test locallyπŸ”΄ High
5Push image to Azure Container Registry🟑 Medium
6Deploy to Azure Container Apps🟑 Medium
7Set up GitHub Actions CI/CD pipeline🟑 Medium
8Add authentication (API key or Azure Entra ID)🟑 Medium
9Load test with sample PDFs and document performance🟒 Low
10Create full architecture diagram with all components🟒 Low

🧩 Reference Implementation β€” FastAPI Service

A minimal, production-leaning implementation of the PDF extraction API:

# main.py
from fastapi import FastAPI, UploadFile, File, HTTPException
from fastapi.responses import JSONResponse
import fitz  # PyMuPDF
import time, os

MAX_BYTES = 25 * 1024 * 1024  # 25 MB
app = FastAPI(title="PDF Extract API", version="1.0.0")

@app.get("/health")
def health():
    return {"status": "ok", "version": os.getenv("APP_VERSION", "dev")}

@app.post("/api/extract")
async def extract(file: UploadFile = File(...)):
    if file.content_type not in ("application/pdf", "application/octet-stream"):
        raise HTTPException(status_code=400, detail="Only PDF accepted")
    data = await file.read()
    if len(data) > MAX_BYTES:
        raise HTTPException(status_code=413, detail="File too large")

    t0 = time.perf_counter()
    try:
        doc = fitz.open(stream=data, filetype="pdf")
    except Exception as e:
        raise HTTPException(status_code=400, detail=f"Invalid PDF: {e}")

    pages = [p.get_text() for p in doc]
    meta = doc.metadata or {}
    result = {
        "text": "\n\n".join(pages),
        "page_count": doc.page_count,
        "metadata": {
            "title": meta.get("title"),
            "author": meta.get("author"),
            "creation_date": meta.get("creationDate"),
        },
        "processing_time_ms": int((time.perf_counter() - t0) * 1000),
    }
    doc.close()
    return JSONResponse(result)

Deploying to Azure Container Apps

# 1. Build and push
ACR=myblogacr
az acr build -r $ACR -t pdf-extract:v1 .

# 2. Create environment and deploy
az containerapp env create -g rg-pdf -n cae-pdf -l eastus
az containerapp create \
  -g rg-pdf -n ca-pdf-extract \
  --environment cae-pdf \
  --image $ACR.azurecr.io/pdf-extract:v1 \
  --registry-server $ACR.azurecr.io \
  --registry-identity system \
  --target-port 8000 --ingress external \
  --min-replicas 0 --max-replicas 5 \
  --cpu 0.5 --memory 1Gi

Testing

curl -X POST -F "file=@sample.pdf" https://<app-url>/api/extract | jq '.page_count'

When all TODO items above are ticked and the /api/extract endpoint handles 100 concurrent uploads without errors, flip status: workinprogress β†’ status: published.