The big picture

A real ML project is not “train a model”

When beginners discover scikit-learn, they think the job is model.fit(X, y). That’s the easy part. A real project is 80% data work, 10% training, and 10% deployment + monitoring. This is the famous iceberg of ML.

flowchart TB
  A["Visible part:<br/>model.fit() · model.predict()"]
  B["Hidden part:<br/>data work<br/>feature engineering<br/>evaluation<br/>deployment<br/>monitoring"]
  A --> B
  classDef visible fill:#fde68a,stroke:#c2410c,color:#451a03
  classDef hidden fill:#cbd5e1,stroke:#475569,color:#0f172a
  A:::visible
  B:::hidden

The ML iceberg — the model is what people show in tutorials, the rest is what actually ships projects.

The 17 steps, end to end

Here is the full supervised-learning lifecycle. Don’t memorise it — just notice the shape: it’s a loop, not a line.

flowchart TB
  S1["1. Business problem"]
  S2["2. Data collection"]
  S3["3. Data exploration (EDA)"]
  S4["4. Data cleaning"]
  S5["5. Choose target y"]
  S6["6. Choose features X"]
  S7["7. Encode categoricals"]
  S8["8. Normalise / standardise"]
  S9["9. Train / test split"]
  S10["10. Choose model"]
  S11["11. Train model"]
  S12["12. Evaluate with metrics"]
  S13["13. Tune hyperparameters"]
  S14["14. Compare models"]
  S15["15. Interpret results"]
  S16["16. Deploy"]
  S17["17. Monitor + retrain"]
  S1 --> S2 --> S3 --> S4 --> S5 --> S6 --> S7 --> S8 --> S9 --> S10 --> S11 --> S12 --> S13 --> S14 --> S15 --> S16 --> S17
  S17 -. drift detected .-> S2
  S12 -. bad score .-> S6
  S15 -. wrong question .-> S1

17 steps and three feedback loops. Most projects iterate the inner ones (12 → 6, 15 → 1) dozens of times.

The short version (10 steps)

If 17 is too many to remember, use this 10-step compression for daily work:

#	Step	Tool / question
1	Business problem	”What decision does this model help?“
2	Data collection	SQL, CSV, API, scraping
3	Data cleaning	Missing, duplicates, outliers, typos
4	Feature selection	Pick `y` (target) + `X` (inputs)
5	Encoding & scaling	One-hot, StandardScaler, MinMax
6	Train / test split	`train_test_split`, stratified
7	Model training	`model.fit(X_train, y_train)`
8	Model evaluation	Right metric for the problem
9	Hyperparameter tuning	GridSearch, RandomSearch, Bayesian
10	Deployment + monitoring	API, batch, drift alerts

This is what you’ll do for every supervised ML project, regardless of the domain.

Three rules that save projects

Garbage in → garbage out. No fancy model fixes bad data. If you only have time for one thing, fix the data.
Always evaluate on data the model has never seen. The number you publish is the score on the test set, never on the training set.
Production is a moving target. A model that scores 95% today will degrade. Plan monitoring before you deploy, not after.

Iteration, not waterfall

Notice the three dotted loops in the diagram above:

12 → 6 — bad evaluation score? Go back to features, not to a fancier model.
15 → 1 — interpretation reveals you answered the wrong business question? Reframe.
17 → 2 — production drift? Refresh the data and retrain.

ML projects that don’t loop are projects that don’t ship. We’ll see each loop in detail across the next lessons.

Key takeaways

A real ML project is 80% data work, not modelling.
The full lifecycle is 17 steps; the daily-work shortcut is 10.
Three core rules: clean data, honest evaluation, monitor in production.
Loops are not failure — they are the normal mode of operation.

Next: Data — collect, explore, clean — the unglamorous 80% of the job.