The big picture
A real ML project is not “train a model”
Section titled “A real ML project is not “train a model””When beginners discover scikit-learn, they think the job is model.fit(X, y). That’s the easy part. A real project is 80% data work, 10% training, and 10% deployment + monitoring. This is the famous iceberg of ML.
flowchart TB A["Visible part:<br/>model.fit() · model.predict()"] B["Hidden part:<br/>data work<br/>feature engineering<br/>evaluation<br/>deployment<br/>monitoring"] A --> B classDef visible fill:#fde68a,stroke:#c2410c,color:#451a03 classDef hidden fill:#cbd5e1,stroke:#475569,color:#0f172a A:::visible B:::hidden
The 17 steps, end to end
Section titled “The 17 steps, end to end”Here is the full supervised-learning lifecycle. Don’t memorise it — just notice the shape: it’s a loop, not a line.
flowchart TB S1["1. Business problem"] S2["2. Data collection"] S3["3. Data exploration (EDA)"] S4["4. Data cleaning"] S5["5. Choose target y"] S6["6. Choose features X"] S7["7. Encode categoricals"] S8["8. Normalise / standardise"] S9["9. Train / test split"] S10["10. Choose model"] S11["11. Train model"] S12["12. Evaluate with metrics"] S13["13. Tune hyperparameters"] S14["14. Compare models"] S15["15. Interpret results"] S16["16. Deploy"] S17["17. Monitor + retrain"] S1 --> S2 --> S3 --> S4 --> S5 --> S6 --> S7 --> S8 --> S9 --> S10 --> S11 --> S12 --> S13 --> S14 --> S15 --> S16 --> S17 S17 -. drift detected .-> S2 S12 -. bad score .-> S6 S15 -. wrong question .-> S1
The short version (10 steps)
Section titled “The short version (10 steps)”If 17 is too many to remember, use this 10-step compression for daily work:
| # | Step | Tool / question |
|---|---|---|
| 1 | Business problem | ”What decision does this model help?“ |
| 2 | Data collection | SQL, CSV, API, scraping |
| 3 | Data cleaning | Missing, duplicates, outliers, typos |
| 4 | Feature selection | Pick y (target) + X (inputs) |
| 5 | Encoding & scaling | One-hot, StandardScaler, MinMax |
| 6 | Train / test split | train_test_split, stratified |
| 7 | Model training | model.fit(X_train, y_train) |
| 8 | Model evaluation | Right metric for the problem |
| 9 | Hyperparameter tuning | GridSearch, RandomSearch, Bayesian |
| 10 | Deployment + monitoring | API, batch, drift alerts |
This is what you’ll do for every supervised ML project, regardless of the domain.
Three rules that save projects
Section titled “Three rules that save projects”- Garbage in → garbage out. No fancy model fixes bad data. If you only have time for one thing, fix the data.
- Always evaluate on data the model has never seen. The number you publish is the score on the test set, never on the training set.
- Production is a moving target. A model that scores 95% today will degrade. Plan monitoring before you deploy, not after.
Iteration, not waterfall
Section titled “Iteration, not waterfall”Notice the three dotted loops in the diagram above:
- 12 → 6 — bad evaluation score? Go back to features, not to a fancier model.
- 15 → 1 — interpretation reveals you answered the wrong business question? Reframe.
- 17 → 2 — production drift? Refresh the data and retrain.
ML projects that don’t loop are projects that don’t ship. We’ll see each loop in detail across the next lessons.
Key takeaways
Section titled “Key takeaways”- A real ML project is 80% data work, not modelling.
- The full lifecycle is 17 steps; the daily-work shortcut is 10.
- Three core rules: clean data, honest evaluation, monitor in production.
- Loops are not failure — they are the normal mode of operation.
Next: Data — collect, explore, clean — the unglamorous 80% of the job.