Deployment & monitoring
A trained model is not a product
Section titled “A trained model is not a product”You’ve fit a Random Forest. It scores 0.92 F1 on the test set. Now what? A model that lives in a notebook helps nobody. The last 20% of an ML project is getting it into production and keeping it alive — and it is what separates a science experiment from a useful tool.
flowchart TB A["<b>1. Trained model</b><br/>in memory / pickle"] --> B["<b>2. Package</b><br/>joblib / ONNX"] B --> C["<b>3. Serve</b><br/>API / batch / streaming"] C --> D["<b>4. Monitor</b><br/>drift, latency, errors"] D -->|"drift or decay detected"| R["<b>5. Retrain</b>"] R -->|"new artefact"| A classDef stage fill:#dbeafe,stroke:#2563eb A:::stage B:::stage C:::stage D:::stage R:::stage
1. Save and load the model
Section titled “1. Save and load the model”import joblib
joblib.dump(pipeline, 'model_v1.joblib') # saveloaded = joblib.load('model_v1.joblib') # loadloaded.predict(X_new)Save the whole pipeline, not just the estimator. That includes the scaler, the encoder, every preprocessing step — otherwise your production code has to redo them by hand and they will drift apart.
For cross-language / cross-framework portability, export to ONNX:
from skl2onnx import convert_sklearnonx = convert_sklearn(pipeline, initial_types=[...])with open('model.onnx', 'wb') as f: f.write(onx.SerializeToString())ONNX files can be loaded from C++, Rust, JavaScript, mobile, etc.
2. Serving — three patterns
Section titled “2. Serving — three patterns”flowchart TB M["Model"] M --> API["REST / gRPC API<br/>(synchronous)"] M --> Batch["Batch job<br/>(nightly Spark / Airflow)"] M --> Stream["Streaming<br/>(Kafka / Flink)"] API -->|"low-latency<br/>per-request"| U1["Web app, mobile"] Batch -->|"high-throughput<br/>not real-time"| U2["Reports, ETL"] Stream -->|"event-driven"| U3["Fraud, recommendations"]
REST API (the most common)
Section titled “REST API (the most common)”# FastAPI examplefrom fastapi import FastAPIfrom pydantic import BaseModelimport joblib
app = FastAPI()model = joblib.load('model_v1.joblib')
class Input(BaseModel): age: int income: float country: str
@app.post('/predict')def predict(x: Input): df = pd.DataFrame([x.dict()]) return {'prediction': int(model.predict(df)[0])}Run with uvicorn app:app, deploy on a container (Docker → Kubernetes / Cloud Run / Fly.io).
Many problems don’t need real-time predictions. Re-score all customers every night with one Spark or Airflow job — much simpler than serving an API.
Streaming
Section titled “Streaming”For fraud, recommendations or anomalies, predictions must happen on each event. Stream them through Kafka + Flink (or AWS Kinesis, Pub/Sub) with the model embedded in a worker.
3. Monitoring — three things to watch
Section titled “3. Monitoring — three things to watch”A model that scored 0.92 on the test set will not score 0.92 forever. The world changes; your model doesn’t. You need continuous monitoring:
| What | Why it matters | How to detect |
|---|---|---|
| Input drift | Customer base shifts | Compare feature distributions (KS test, PSI) |
| Prediction drift | Output distribution shifts | Track % positive over time |
| Performance | Ground truth eventually arrives | Recompute metric on the new labels |
| Latency / errors | Service health | Standard APM (Datadog, Grafana) |
Drift example
Section titled “Drift example”from scipy.stats import ks_2sampstat, p = ks_2samp(X_train['age'], X_prod['age'])if p < 0.01: alert('Age distribution drifted significantly')Tools that handle all of this for you: Evidently AI, Arize, WhyLabs, Fiddler.
4. The retraining loop
Section titled “4. The retraining loop”When drift or performance decay crosses a threshold, retrain. Two patterns:
| Pattern | When | Pros | Cons |
|---|---|---|---|
| Scheduled (e.g. every Monday) | Slow-drifting domains | Simple, predictable | Stale between retrains |
| Trigger-based (drift alert) | Volatile domains | Responsive | More plumbing |
Either way, the new model goes through the same lifecycle from step 2 (data collection through evaluation) before it replaces the old one — never push a retrained model straight to prod.
Shadow mode is your friend: deploy the new model alongside the old one, log both predictions, compare for a week. Then switch.
5. The version trio you must track
Section titled “5. The version trio you must track”For every model in production, log:
- Code version — git commit of the training script.
- Data version — snapshot or DVC hash of the training data.
- Model artefact — the
.joblib/.onnxfile with a unique version tag.
Without these three, debugging a bad prediction is impossible — you can’t reproduce what happened. Tools that help: MLflow, DVC, Weights & Biases.
6. Minimal MLOps mindset
Section titled “6. Minimal MLOps mindset”You don’t need a full MLOps stack to ship a model. The minimum viable production setup is:
- The pipeline (preprocessing + model) saved as one artefact.
- A REST API or batch job that loads the artefact and serves it.
- Logging: input + prediction + timestamp for every call.
- A weekly job that recomputes the metric on freshly-labelled data.
- An alert when the metric drops below a threshold.
Five pieces. Anything beyond that is gravy.
Closing the loop on the lifecycle
Section titled “Closing the loop on the lifecycle”We’ve gone all the way around the 17 steps. The model is now alive and being watched. When monitoring flags a problem, you’ll loop back to step 2 (data collection) — and the cycle starts again.
That’s the thing to remember about supervised ML: a model is never finished. It is maintained, like a garden.
Key takeaways
Section titled “Key takeaways”- Save the whole pipeline, not just the estimator (joblib / ONNX).
- Three serving patterns: API, batch, streaming. Pick by latency.
- Monitor input drift, prediction drift, and performance (once labels arrive).
- Retrain on schedule or on trigger; always go through the full lifecycle before promoting.
- Track code + data + artefact for every model. Reproducibility = debuggability.
- A model is maintained, not finished.
That closes Part 2. Up next: Part 3 — NLP basics — how machines read text before the Transformer era.