Skip to content

Data: collect, explore, clean

No matter the model — linear regression or Transformer — the data is the ceiling. A perfect algorithm on garbage data gives you a confident liar; an average algorithm on clean data gives you a useful tool.

flowchart LR
  A["1. Collect"] --> B["2. Explore (EDA)"] --> C["3. Clean"]
  C -.->|"new questions"| B
  classDef step fill:#dbeafe,stroke:#2563eb,color:#0c4a6e
  A:::step
  B:::step
  C:::step
Collect, explore, clean — and almost always loop back to explore once you've cleaned a column.

1. Data collection — where it comes from

Section titled “1. Data collection — where it comes from”

Common sources, ranked roughly by how much pain they cause:

SourceTypical usePain level
Internal database (SQL)Customer data, transactionsLow
CSV / Excel exportOne-off analysesLow
REST APIThird-party data (weather, finance)Medium
Web scrapingPublic sites without APIHigh (legal + brittle)
Sensor / IoTHardware telemetryHigh (volume, noise)
Public datasetsKaggle, UCI, HuggingFaceLow (already cleaned)

Golden rule: write down where each column came from. Three months later you will not remember.

Before touching a model, look at the data. Five questions cover 90% of EDA:

  1. Shape — how many rows, how many columns?
  2. Types — what’s numeric, what’s text, what’s a date?
  3. Missing — how much is missing, and where?
  4. Distributions — what does each numeric column look like (histogram)?
  5. Correlations — which features move together?

Minimal pandas/seaborn snippet:

import pandas as pd
import seaborn as sns
df = pd.read_csv("data.csv")
print(df.shape) # rows, columns
print(df.dtypes) # types
print(df.isna().mean() * 100) # % missing per column
df.hist(figsize=(12, 8)) # distributions
sns.heatmap(df.corr()) # correlation matrix
flowchart TB
  D["Raw data"] --> M["Missing<br/>values"]
  D --> P["Duplicates"]
  D --> O["Outliers"]
  D --> T["Typos &<br/>inconsistencies"]
  M --> C["Clean dataset"]
  P --> C
  O --> C
  T --> C
The four cleanliness problems you will face on every dataset.

Three strategies, in order of seriousness:

StrategyWhenCode
Drop the rowsFew missing rows, not biaseddf.dropna()
Drop the columnA column is >50% emptydf.drop(columns=['col'])
ImputeNeed to keep the datadf['age'].fillna(df['age'].median())

The honest rule: never impute the target y — better to drop those rows.

df = df.drop_duplicates()

Watch for near-duplicates too (same customer, slightly different name). Standardise text (lowercase, strip spaces) before deduping.

Decision: are they errors (drop) or rare-but-real (keep)?

import numpy as np
q1, q3 = df['price'].quantile([0.25, 0.75])
iqr = q3 - q1
mask = (df['price'] >= q1 - 1.5*iqr) & (df['price'] <= q3 + 1.5*iqr)
df = df[mask] # only if you decided they are errors

"USA", "U.S.A.", "united states", "Etats-Unis" are the same country. Normalise:

df['country'] = df['country'].str.lower().str.strip()
mapping = {"u.s.a.": "usa", "united states": "usa", "etats-unis": "usa"}
df['country'] = df['country'].replace(mapping)

Before moving on, do a final eyeball:

df.sample(20) # 20 random rows
df.describe() # min, max, mean — anything obviously wrong?
df.dtypes # everything in the type you expect?

If anything looks suspicious — go back and clean again. It is much cheaper than a model that ships nonsense.

  • Data is the ceiling — the model can only be as good as what you feed it.
  • EDA = 5 questions: shape, types, missing, distributions, correlations.
  • Four enemies of cleanliness: missing, duplicates, outliers, typos.
  • Never impute the target. Better to drop those rows.
  • Always do a final “smell test” — random sample + describe.

Next: Features — target, encoding, scaling — turning a clean table into something a model can eat.