Deploying AI Models with FastAPI
Build production-ready AI APIs
Learn how to deploy AI and machine learning models as scalable, high-performance APIs using FastAPI and best practices for production environments.
FastAPI AI Deployment
Learn how to deploy AI models as scalable, high-performance APIs using FastAPI.
Tutorial Series
Introduction to FastAPI for AI Deployment
FastAPI is a modern, high-performance web framework for building APIs with Python. Its combination of speed, automatic documentation, and type checking makes it an excellent choice for deploying AI models in production.
Key advantages of FastAPI for AI deployment include:
- High performance based on Starlette and Pydantic
- Automatic API documentation with Swagger UI
- Type checking and validation with Python type hints
- Asynchronous support for handling concurrent requests
- Easy integration with machine learning frameworks
Setting Up Your Development Environment
Let's start by setting up a development environment for our AI API:
# Create a virtual environment python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate # Install required packages pip install fastapi uvicorn pydantic python-multipart pip install torch transformers sentence-transformers pillow
Project Structure
A well-organized project structure helps maintain and scale your API:
ai_api/ ├── app/ │ ├── __init__.py │ ├── main.py # FastAPI application │ ├── models/ │ │ ├── __init__.py │ │ └── ml_models.py # ML model loading and inference │ ├── routers/ │ │ ├── __init__.py │ │ ├── text.py # Text-related endpoints │ │ └── image.py # Image-related endpoints │ ├── schemas/ │ │ ├── __init__.py │ │ └── request.py # Pydantic models for requests/responses │ └── utils/ │ ├── __init__.py │ └── helpers.py # Utility functions ├── tests/ │ ├── __init__.py │ └── test_api.py # API tests ├── .env # Environment variables ├── Dockerfile # Docker configuration ├── requirements.txt # Dependencies └── README.md # Documentation
Creating a Basic FastAPI Application
Let's create a simple FastAPI application (app/main.py):
from fastapi import FastAPI, HTTPException
from fastapi.middleware.cors import CORSMiddleware
import uvicorn
# Import routers
from app.routers import text, image
# Create FastAPI app
app = FastAPI(
title="AI Model API",
description="API for serving AI models for text and image processing",
version="1.0.0"
)
# Configure CORS
app.add_middleware(
CORSMiddleware,
allow_origins=["*"], # In production, replace with specific origins
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
# Include routers
app.include_router(text.router, prefix="/api/text", tags=["Text"])
app.include_router(image.router, prefix="/api/image", tags=["Image"])
# Health check endpoint
@app.get("/health", tags=["Health"])
async def health_check():
return {"status": "healthy"}
if __name__ == "__main__":
uvicorn.run("app.main:app", host="0.0.0.0", port=8000, reload=True)Defining Request and Response Models
Using Pydantic models for request and response validation (app/schemas/request.py):
from pydantic import BaseModel, Field
from typing import List, Optional
class TextRequest(BaseModel):
text: str = Field(..., min_length=1, max_length=5000, description="Input text for processing")
options: Optional[dict] = Field(default={}, description="Additional options for processing")
class TextResponse(BaseModel):
result: str = Field(..., description="Processed text result")
confidence: float = Field(..., ge=0, le=1, description="Confidence score")
processing_time: float = Field(..., description="Processing time in seconds")
class ImageRequest(BaseModel):
image_url: Optional[str] = Field(default=None, description="URL of the image to process")
# Note: For file uploads, we'll use Form and File instead of this model
class ImageResponse(BaseModel):
results: List[dict] = Field(..., description="List of detection results")
processing_time: float = Field(..., description="Processing time in seconds")Loading and Serving AI Models
Let's create a module for loading and serving our AI models (app/models/ml_models.py):
import torch
from transformers import pipeline
from sentence_transformers import SentenceTransformer
import time
import os
# Singleton pattern for model loading
class ModelLoader:
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = super(ModelLoader, cls).__new__(cls)
cls._instance._load_models()
return cls._instance
def _load_models(self):
# Text classification model
self.sentiment_model = pipeline(
"sentiment-analysis",
model="distilbert-base-uncased-finetuned-sst-2-english",
device=0 if torch.cuda.is_available() else -1
)
# Text embedding model
self.embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
# Image classification model
self.image_model = pipeline(
"image-classification",
model="google/vit-base-patch16-224",
device=0 if torch.cuda.is_available() else -1
)
def get_sentiment_model(self):
return self.sentiment_model
def get_embedding_model(self):
return self.embedding_model
def get_image_model(self):
return self.image_model
# Functions for model inference
def analyze_sentiment(text):
start_time = time.time()
model = ModelLoader().get_sentiment_model()
result = model(text)[0]
return {
"result": result["label"],
"confidence": result["score"],
"processing_time": time.time() - start_time
}
def generate_embeddings(text):
start_time = time.time()
model = ModelLoader().get_embedding_model()
embedding = model.encode(text)
return {
"result": embedding.tolist(),
"confidence": 1.0, # Embeddings don't have confidence scores
"processing_time": time.time() - start_time
}
def classify_image(image_path):
start_time = time.time()
model = ModelLoader().get_image_model()
results = model(image_path)
return {
"results": results,
"processing_time": time.time() - start_time
}Creating API Endpoints
Now, let's create endpoints for text processing (app/routers/text.py):
from fastapi import APIRouter, HTTPException, BackgroundTasks
from app.schemas.request import TextRequest, TextResponse
from app.models.ml_models import analyze_sentiment, generate_embeddings
import time
router = APIRouter()
@router.post("/sentiment", response_model=TextResponse)
async def sentiment_analysis(request: TextRequest):
try:
result = analyze_sentiment(request.text)
return result
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post("/embeddings", response_model=dict)
async def text_embeddings(request: TextRequest):
try:
result = generate_embeddings(request.text)
return result
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
# Example of a background task
@router.post("/async-process")
async def process_text_async(request: TextRequest, background_tasks: BackgroundTasks):
# Add task to background
background_tasks.add_task(process_long_running_task, request.text)
return {"message": "Processing started in background"}
# Function to be executed in background
def process_long_running_task(text: str):
# Simulate long-running task
time.sleep(10)
# Process text and store results
result = analyze_sentiment(text)
# Here you would typically store the result in a database or cacheAnd for image processing (app/routers/image.py):
from fastapi import APIRouter, HTTPException, UploadFile, File, Form
from fastapi.responses import JSONResponse
from app.models.ml_models import classify_image
import shutil
import os
import tempfile
from typing import Optional
import aiohttp
import aiofiles
router = APIRouter()
@router.post("/classify")
async def classify_uploaded_image(
file: Optional[UploadFile] = File(None),
image_url: Optional[str] = Form(None)
):
if file is None and image_url is None:
raise HTTPException(status_code=400, detail="Either file or image_url must be provided")
try:
# Create a temporary file
with tempfile.NamedTemporaryFile(delete=False) as temp:
temp_path = temp.name
if file:
# Save uploaded file to temp location
shutil.copyfileobj(file.file, temp)
elif image_url:
# Download image from URL
async with aiohttp.ClientSession() as session:
async with session.get(image_url) as response:
if response.status != 200:
raise HTTPException(status_code=400, detail="Could not download image")
async with aiofiles.open(temp_path, 'wb') as f:
await f.write(await response.read())
# Process the image
result = classify_image(temp_path)
# Clean up
os.unlink(temp_path)
return result
except Exception as e:
# Clean up in case of error
if 'temp_path' in locals():
os.unlink(temp_path)
raise HTTPException(status_code=500, detail=str(e))Optimizing Performance
To handle production workloads, we need to optimize our API:
1. Model Optimization
- Quantization: Reduce model size and increase inference speed
- Model Pruning: Remove unnecessary weights
- Distillation: Create smaller models that mimic larger ones
- Batching: Process multiple requests together
2. Asynchronous Processing
FastAPI supports asynchronous request handling:
@router.post("/batch-process")
async def batch_process(requests: List[TextRequest]):
# Process multiple requests concurrently
results = await asyncio.gather(*[process_single_request(req) for req in requests])
return results
async def process_single_request(request: TextRequest):
# Use asyncio.to_thread for CPU-bound operations
result = await asyncio.to_thread(analyze_sentiment, request.text)
return result3. Caching
Implement caching to avoid redundant computations:
from fastapi_cache import FastAPICache
from fastapi_cache.backends.redis import RedisBackend
from fastapi_cache.decorator import cache
from redis import asyncio as aioredis
# Initialize cache in main.py
@app.on_event("startup")
async def startup():
redis = aioredis.from_url("redis://localhost", encoding="utf8")
FastAPICache.init(RedisBackend(redis), prefix="fastapi-cache:")
# Use cache decorator
@router.post("/sentiment")
@cache(expire=3600) # Cache for 1 hour
async def sentiment_analysis(request: TextRequest):
result = analyze_sentiment(request.text)
return resultContainerization with Docker
Create a Dockerfile for your API:
FROM python:3.9-slim WORKDIR /app # Install system dependencies RUN apt-get update && apt-get install -y --no-install-recommends build-essential && rm -rf /var/lib/apt/lists/* # Copy requirements and install dependencies COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt # Copy application code COPY . . # Expose port EXPOSE 8000 # Command to run the application CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000"]
And a docker-compose.yml file for local development:
version: '3'
services:
api:
build: .
ports:
- "8000:8000"
volumes:
- .:/app
environment:
- ENVIRONMENT=development
command: uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload
redis:
image: redis:alpine
ports:
- "6379:6379"Scaling in Production
For production deployment, consider these scaling strategies:
1. Horizontal Scaling
Deploy multiple instances of your API behind a load balancer:
- Use Kubernetes for orchestration
- Implement auto-scaling based on CPU/memory usage
- Use a load balancer (e.g., Nginx, Traefik) to distribute traffic
2. Model Serving Platforms
Consider specialized platforms for model serving:
- TorchServe for PyTorch models
- TensorFlow Serving for TensorFlow models
- Triton Inference Server for multiple frameworks
3. Serverless Deployment
For variable workloads, serverless can be cost-effective:
- AWS Lambda with API Gateway
- Google Cloud Functions
- Azure Functions
Monitoring and Logging
Implement comprehensive monitoring for your API:
from fastapi import FastAPI, Request
import time
import logging
from prometheus_fastapi_instrumentator import Instrumentator
# Set up logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI()
# Add middleware for request logging
@app.middleware("http")
async def log_requests(request: Request, call_next):
start_time = time.time()
response = await call_next(request)
process_time = time.time() - start_time
logger.info(f"Path: {request.url.path} Method: {request.method} Time: {process_time:.4f}s")
return response
# Set up Prometheus metrics
Instrumentator().instrument(app).expose(app)Security Best Practices
Implement these security measures for your API:
- Authentication: Use OAuth2 or API keys
- Rate Limiting: Prevent abuse with request limits
- Input Validation: Validate all inputs with Pydantic
- HTTPS: Always use TLS in production
- CORS: Configure proper Cross-Origin Resource Sharing
- Dependency Updates: Regularly update dependencies
Conclusion
FastAPI provides an excellent framework for deploying AI models in production. By following the best practices outlined in this tutorial, you can build scalable, secure, and high-performance APIs that serve your machine learning models effectively.
Remember that deploying AI in production is an ongoing process that requires monitoring, maintenance, and continuous improvement. As your application grows, you may need to adapt your architecture to meet changing requirements and scale to handle increased traffic.