Data: collect, explore, clean
Why data work eats the schedule
Section titled “Why data work eats the schedule”No matter the model — linear regression or Transformer — the data is the ceiling. A perfect algorithm on garbage data gives you a confident liar; an average algorithm on clean data gives you a useful tool.
flowchart LR A["1. Collect"] --> B["2. Explore (EDA)"] --> C["3. Clean"] C -.->|"new questions"| B classDef step fill:#dbeafe,stroke:#2563eb,color:#0c4a6e A:::step B:::step C:::step
1. Data collection — where it comes from
Section titled “1. Data collection — where it comes from”Common sources, ranked roughly by how much pain they cause:
| Source | Typical use | Pain level |
|---|---|---|
| Internal database (SQL) | Customer data, transactions | Low |
| CSV / Excel export | One-off analyses | Low |
| REST API | Third-party data (weather, finance) | Medium |
| Web scraping | Public sites without API | High (legal + brittle) |
| Sensor / IoT | Hardware telemetry | High (volume, noise) |
| Public datasets | Kaggle, UCI, HuggingFace | Low (already cleaned) |
Golden rule: write down where each column came from. Three months later you will not remember.
2. Data exploration (EDA)
Section titled “2. Data exploration (EDA)”Before touching a model, look at the data. Five questions cover 90% of EDA:
- Shape — how many rows, how many columns?
- Types — what’s numeric, what’s text, what’s a date?
- Missing — how much is missing, and where?
- Distributions — what does each numeric column look like (histogram)?
- Correlations — which features move together?
Minimal pandas/seaborn snippet:
import pandas as pdimport seaborn as sns
df = pd.read_csv("data.csv")
print(df.shape) # rows, columnsprint(df.dtypes) # typesprint(df.isna().mean() * 100) # % missing per columndf.hist(figsize=(12, 8)) # distributionssns.heatmap(df.corr()) # correlation matrix3. Data cleaning — the four big enemies
Section titled “3. Data cleaning — the four big enemies”flowchart TB D["Raw data"] --> M["Missing<br/>values"] D --> P["Duplicates"] D --> O["Outliers"] D --> T["Typos &<br/>inconsistencies"] M --> C["Clean dataset"] P --> C O --> C T --> C
Missing values
Section titled “Missing values”Three strategies, in order of seriousness:
| Strategy | When | Code |
|---|---|---|
| Drop the rows | Few missing rows, not biased | df.dropna() |
| Drop the column | A column is >50% empty | df.drop(columns=['col']) |
| Impute | Need to keep the data | df['age'].fillna(df['age'].median()) |
The honest rule: never impute the target y — better to drop those rows.
Duplicates
Section titled “Duplicates”df = df.drop_duplicates()Watch for near-duplicates too (same customer, slightly different name). Standardise text (lowercase, strip spaces) before deduping.
Outliers
Section titled “Outliers”Decision: are they errors (drop) or rare-but-real (keep)?
import numpy as npq1, q3 = df['price'].quantile([0.25, 0.75])iqr = q3 - q1mask = (df['price'] >= q1 - 1.5*iqr) & (df['price'] <= q3 + 1.5*iqr)df = df[mask] # only if you decided they are errorsTypos and inconsistencies
Section titled “Typos and inconsistencies”"USA", "U.S.A.", "united states", "Etats-Unis" are the same country. Normalise:
df['country'] = df['country'].str.lower().str.strip()mapping = {"u.s.a.": "usa", "united states": "usa", "etats-unis": "usa"}df['country'] = df['country'].replace(mapping)The “smell test”
Section titled “The “smell test””Before moving on, do a final eyeball:
df.sample(20) # 20 random rowsdf.describe() # min, max, mean — anything obviously wrong?df.dtypes # everything in the type you expect?If anything looks suspicious — go back and clean again. It is much cheaper than a model that ships nonsense.
Key takeaways
Section titled “Key takeaways”- Data is the ceiling — the model can only be as good as what you feed it.
- EDA = 5 questions: shape, types, missing, distributions, correlations.
- Four enemies of cleanliness: missing, duplicates, outliers, typos.
- Never impute the target. Better to drop those rows.
- Always do a final “smell test” — random sample + describe.
Next: Features — target, encoding, scaling — turning a clean table into something a model can eat.