Skip to content

Deployment & monitoring

You’ve fit a Random Forest. It scores 0.92 F1 on the test set. Now what? A model that lives in a notebook helps nobody. The last 20% of an ML project is getting it into production and keeping it alive — and it is what separates a science experiment from a useful tool.

flowchart TB
  A["<b>1. Trained model</b><br/>in memory / pickle"] --> B["<b>2. Package</b><br/>joblib / ONNX"]
  B --> C["<b>3. Serve</b><br/>API / batch / streaming"]
  C --> D["<b>4. Monitor</b><br/>drift, latency, errors"]
  D -->|"drift or decay detected"| R["<b>5. Retrain</b>"]
  R -->|"new artefact"| A
  classDef stage fill:#dbeafe,stroke:#2563eb
  A:::stage
  B:::stage
  C:::stage
  D:::stage
  R:::stage
The deployment loop. The work doesn't stop at training — it starts there.
import joblib
joblib.dump(pipeline, 'model_v1.joblib') # save
loaded = joblib.load('model_v1.joblib') # load
loaded.predict(X_new)

Save the whole pipeline, not just the estimator. That includes the scaler, the encoder, every preprocessing step — otherwise your production code has to redo them by hand and they will drift apart.

For cross-language / cross-framework portability, export to ONNX:

from skl2onnx import convert_sklearn
onx = convert_sklearn(pipeline, initial_types=[...])
with open('model.onnx', 'wb') as f:
f.write(onx.SerializeToString())

ONNX files can be loaded from C++, Rust, JavaScript, mobile, etc.

flowchart TB
  M["Model"]
  M --> API["REST / gRPC API<br/>(synchronous)"]
  M --> Batch["Batch job<br/>(nightly Spark / Airflow)"]
  M --> Stream["Streaming<br/>(Kafka / Flink)"]
  API -->|"low-latency<br/>per-request"| U1["Web app, mobile"]
  Batch -->|"high-throughput<br/>not real-time"| U2["Reports, ETL"]
  Stream -->|"event-driven"| U3["Fraud, recommendations"]
Three serving patterns — pick by latency and throughput needs.
# FastAPI example
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
app = FastAPI()
model = joblib.load('model_v1.joblib')
class Input(BaseModel):
age: int
income: float
country: str
@app.post('/predict')
def predict(x: Input):
df = pd.DataFrame([x.dict()])
return {'prediction': int(model.predict(df)[0])}

Run with uvicorn app:app, deploy on a container (Docker → Kubernetes / Cloud Run / Fly.io).

Many problems don’t need real-time predictions. Re-score all customers every night with one Spark or Airflow job — much simpler than serving an API.

For fraud, recommendations or anomalies, predictions must happen on each event. Stream them through Kafka + Flink (or AWS Kinesis, Pub/Sub) with the model embedded in a worker.

A model that scored 0.92 on the test set will not score 0.92 forever. The world changes; your model doesn’t. You need continuous monitoring:

WhatWhy it mattersHow to detect
Input driftCustomer base shiftsCompare feature distributions (KS test, PSI)
Prediction driftOutput distribution shiftsTrack % positive over time
PerformanceGround truth eventually arrivesRecompute metric on the new labels
Latency / errorsService healthStandard APM (Datadog, Grafana)
from scipy.stats import ks_2samp
stat, p = ks_2samp(X_train['age'], X_prod['age'])
if p < 0.01:
alert('Age distribution drifted significantly')

Tools that handle all of this for you: Evidently AI, Arize, WhyLabs, Fiddler.

When drift or performance decay crosses a threshold, retrain. Two patterns:

PatternWhenProsCons
Scheduled (e.g. every Monday)Slow-drifting domainsSimple, predictableStale between retrains
Trigger-based (drift alert)Volatile domainsResponsiveMore plumbing

Either way, the new model goes through the same lifecycle from step 2 (data collection through evaluation) before it replaces the old one — never push a retrained model straight to prod.

Shadow mode is your friend: deploy the new model alongside the old one, log both predictions, compare for a week. Then switch.

For every model in production, log:

  1. Code version — git commit of the training script.
  2. Data version — snapshot or DVC hash of the training data.
  3. Model artefact — the .joblib / .onnx file with a unique version tag.

Without these three, debugging a bad prediction is impossible — you can’t reproduce what happened. Tools that help: MLflow, DVC, Weights & Biases.

You don’t need a full MLOps stack to ship a model. The minimum viable production setup is:

  1. The pipeline (preprocessing + model) saved as one artefact.
  2. A REST API or batch job that loads the artefact and serves it.
  3. Logging: input + prediction + timestamp for every call.
  4. A weekly job that recomputes the metric on freshly-labelled data.
  5. An alert when the metric drops below a threshold.

Five pieces. Anything beyond that is gravy.

We’ve gone all the way around the 17 steps. The model is now alive and being watched. When monitoring flags a problem, you’ll loop back to step 2 (data collection) — and the cycle starts again.

That’s the thing to remember about supervised ML: a model is never finished. It is maintained, like a garden.

  • Save the whole pipeline, not just the estimator (joblib / ONNX).
  • Three serving patterns: API, batch, streaming. Pick by latency.
  • Monitor input drift, prediction drift, and performance (once labels arrive).
  • Retrain on schedule or on trigger; always go through the full lifecycle before promoting.
  • Track code + data + artefact for every model. Reproducibility = debuggability.
  • A model is maintained, not finished.

That closes Part 2. Up next: Part 3 — NLP basics — how machines read text before the Transformer era.