Skip to content

The big picture

A real ML project is not “train a model”

Section titled “A real ML project is not “train a model””

When beginners discover scikit-learn, they think the job is model.fit(X, y). That’s the easy part. A real project is 80% data work, 10% training, and 10% deployment + monitoring. This is the famous iceberg of ML.

flowchart TB
  A["Visible part:<br/>model.fit() · model.predict()"]
  B["Hidden part:<br/>data work<br/>feature engineering<br/>evaluation<br/>deployment<br/>monitoring"]
  A --> B
  classDef visible fill:#fde68a,stroke:#c2410c,color:#451a03
  classDef hidden fill:#cbd5e1,stroke:#475569,color:#0f172a
  A:::visible
  B:::hidden
The ML iceberg — the model is what people show in tutorials, the rest is what actually ships projects.

Here is the full supervised-learning lifecycle. Don’t memorise it — just notice the shape: it’s a loop, not a line.

flowchart TB
  S1["1. Business problem"]
  S2["2. Data collection"]
  S3["3. Data exploration (EDA)"]
  S4["4. Data cleaning"]
  S5["5. Choose target y"]
  S6["6. Choose features X"]
  S7["7. Encode categoricals"]
  S8["8. Normalise / standardise"]
  S9["9. Train / test split"]
  S10["10. Choose model"]
  S11["11. Train model"]
  S12["12. Evaluate with metrics"]
  S13["13. Tune hyperparameters"]
  S14["14. Compare models"]
  S15["15. Interpret results"]
  S16["16. Deploy"]
  S17["17. Monitor + retrain"]
  S1 --> S2 --> S3 --> S4 --> S5 --> S6 --> S7 --> S8 --> S9 --> S10 --> S11 --> S12 --> S13 --> S14 --> S15 --> S16 --> S17
  S17 -. drift detected .-> S2
  S12 -. bad score .-> S6
  S15 -. wrong question .-> S1
17 steps and three feedback loops. Most projects iterate the inner ones (12 → 6, 15 → 1) dozens of times.

If 17 is too many to remember, use this 10-step compression for daily work:

#StepTool / question
1Business problem”What decision does this model help?“
2Data collectionSQL, CSV, API, scraping
3Data cleaningMissing, duplicates, outliers, typos
4Feature selectionPick y (target) + X (inputs)
5Encoding & scalingOne-hot, StandardScaler, MinMax
6Train / test splittrain_test_split, stratified
7Model trainingmodel.fit(X_train, y_train)
8Model evaluationRight metric for the problem
9Hyperparameter tuningGridSearch, RandomSearch, Bayesian
10Deployment + monitoringAPI, batch, drift alerts

This is what you’ll do for every supervised ML project, regardless of the domain.

  1. Garbage in → garbage out. No fancy model fixes bad data. If you only have time for one thing, fix the data.
  2. Always evaluate on data the model has never seen. The number you publish is the score on the test set, never on the training set.
  3. Production is a moving target. A model that scores 95% today will degrade. Plan monitoring before you deploy, not after.

Notice the three dotted loops in the diagram above:

  • 12 → 6 — bad evaluation score? Go back to features, not to a fancier model.
  • 15 → 1 — interpretation reveals you answered the wrong business question? Reframe.
  • 17 → 2 — production drift? Refresh the data and retrain.

ML projects that don’t loop are projects that don’t ship. We’ll see each loop in detail across the next lessons.

  • A real ML project is 80% data work, not modelling.
  • The full lifecycle is 17 steps; the daily-work shortcut is 10.
  • Three core rules: clean data, honest evaluation, monitor in production.
  • Loops are not failure — they are the normal mode of operation.

Next: Data — collect, explore, clean — the unglamorous 80% of the job.