Deployment & monitoring

A trained model is not a product

You’ve fit a Random Forest. It scores 0.92 F1 on the test set. Now what? A model that lives in a notebook helps nobody. The last 20% of an ML project is getting it into production and keeping it alive — and it is what separates a science experiment from a useful tool.

flowchart TB
  A["<b>1. Trained model</b><br/>in memory / pickle"] --> B["<b>2. Package</b><br/>joblib / ONNX"]
  B --> C["<b>3. Serve</b><br/>API / batch / streaming"]
  C --> D["<b>4. Monitor</b><br/>drift, latency, errors"]
  D -->|"drift or decay detected"| R["<b>5. Retrain</b>"]
  R -->|"new artefact"| A
  classDef stage fill:#dbeafe,stroke:#2563eb
  A:::stage
  B:::stage
  C:::stage
  D:::stage
  R:::stage

The deployment loop. The work doesn't stop at training — it starts there.

1. Save and load the model

import joblib

joblib.dump(pipeline, 'model_v1.joblib')         # save
loaded = joblib.load('model_v1.joblib')          # load
loaded.predict(X_new)

Save the whole pipeline, not just the estimator. That includes the scaler, the encoder, every preprocessing step — otherwise your production code has to redo them by hand and they will drift apart.

For cross-language / cross-framework portability, export to ONNX:

from skl2onnx import convert_sklearn
onx = convert_sklearn(pipeline, initial_types=[...])
with open('model.onnx', 'wb') as f:
    f.write(onx.SerializeToString())

ONNX files can be loaded from C++, Rust, JavaScript, mobile, etc.

2. Serving — three patterns

flowchart TB
  M["Model"]
  M --> API["REST / gRPC API<br/>(synchronous)"]
  M --> Batch["Batch job<br/>(nightly Spark / Airflow)"]
  M --> Stream["Streaming<br/>(Kafka / Flink)"]
  API -->|"low-latency<br/>per-request"| U1["Web app, mobile"]
  Batch -->|"high-throughput<br/>not real-time"| U2["Reports, ETL"]
  Stream -->|"event-driven"| U3["Fraud, recommendations"]

Three serving patterns — pick by latency and throughput needs.

REST API (the most common)

# FastAPI example
from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
model = joblib.load('model_v1.joblib')

class Input(BaseModel):
    age: int
    income: float
    country: str

@app.post('/predict')
def predict(x: Input):
    df = pd.DataFrame([x.dict()])
    return {'prediction': int(model.predict(df)[0])}

Run with uvicorn app:app, deploy on a container (Docker → Kubernetes / Cloud Run / Fly.io).

Batch

Many problems don’t need real-time predictions. Re-score all customers every night with one Spark or Airflow job — much simpler than serving an API.

Streaming

For fraud, recommendations or anomalies, predictions must happen on each event. Stream them through Kafka + Flink (or AWS Kinesis, Pub/Sub) with the model embedded in a worker.

3. Monitoring — three things to watch

A model that scored 0.92 on the test set will not score 0.92 forever. The world changes; your model doesn’t. You need continuous monitoring:

What	Why it matters	How to detect
Input drift	Customer base shifts	Compare feature distributions (KS test, PSI)
Prediction drift	Output distribution shifts	Track `% positive` over time
Performance	Ground truth eventually arrives	Recompute metric on the new labels
Latency / errors	Service health	Standard APM (Datadog, Grafana)

Drift example

from scipy.stats import ks_2samp
stat, p = ks_2samp(X_train['age'], X_prod['age'])
if p < 0.01:
    alert('Age distribution drifted significantly')

Tools that handle all of this for you: Evidently AI, Arize, WhyLabs, Fiddler.

4. The retraining loop

When drift or performance decay crosses a threshold, retrain. Two patterns:

Pattern	When	Pros	Cons
Scheduled (e.g. every Monday)	Slow-drifting domains	Simple, predictable	Stale between retrains
Trigger-based (drift alert)	Volatile domains	Responsive	More plumbing

Either way, the new model goes through the same lifecycle from step 2 (data collection through evaluation) before it replaces the old one — never push a retrained model straight to prod.

Shadow mode is your friend: deploy the new model alongside the old one, log both predictions, compare for a week. Then switch.

5. The version trio you must track

For every model in production, log:

Code version — git commit of the training script.
Data version — snapshot or DVC hash of the training data.
Model artefact — the .joblib / .onnx file with a unique version tag.

Without these three, debugging a bad prediction is impossible — you can’t reproduce what happened. Tools that help: MLflow, DVC, Weights & Biases.

6. Minimal MLOps mindset

You don’t need a full MLOps stack to ship a model. The minimum viable production setup is:

The pipeline (preprocessing + model) saved as one artefact.
A REST API or batch job that loads the artefact and serves it.
Logging: input + prediction + timestamp for every call.
A weekly job that recomputes the metric on freshly-labelled data.
An alert when the metric drops below a threshold.

Five pieces. Anything beyond that is gravy.

Closing the loop on the lifecycle

We’ve gone all the way around the 17 steps. The model is now alive and being watched. When monitoring flags a problem, you’ll loop back to step 2 (data collection) — and the cycle starts again.

That’s the thing to remember about supervised ML: a model is never finished. It is maintained, like a garden.

Key takeaways

Save the whole pipeline, not just the estimator (joblib / ONNX).
Three serving patterns: API, batch, streaming. Pick by latency.
Monitor input drift, prediction drift, and performance (once labels arrive).
Retrain on schedule or on trigger; always go through the full lifecycle before promoting.
Track code + data + artefact for every model. Reproducibility = debuggability.
A model is maintained, not finished.

That closes Part 2. Up next: Part 3 — NLP basics — how machines read text before the Transformer era.