Data: collect, explore, clean

Why data work eats the schedule

No matter the model — linear regression or Transformer — the data is the ceiling. A perfect algorithm on garbage data gives you a confident liar; an average algorithm on clean data gives you a useful tool.

flowchart LR
  A["1. Collect"] --> B["2. Explore (EDA)"] --> C["3. Clean"]
  C -.->|"new questions"| B
  classDef step fill:#dbeafe,stroke:#2563eb,color:#0c4a6e
  A:::step
  B:::step
  C:::step

Collect, explore, clean — and almost always loop back to explore once you've cleaned a column.

1. Data collection — where it comes from

Common sources, ranked roughly by how much pain they cause:

Source	Typical use	Pain level
Internal database (SQL)	Customer data, transactions	Low
CSV / Excel export	One-off analyses	Low
REST API	Third-party data (weather, finance)	Medium
Web scraping	Public sites without API	High (legal + brittle)
Sensor / IoT	Hardware telemetry	High (volume, noise)
Public datasets	Kaggle, UCI, HuggingFace	Low (already cleaned)

Golden rule: write down where each column came from. Three months later you will not remember.

2. Data exploration (EDA)

Before touching a model, look at the data. Five questions cover 90% of EDA:

Shape — how many rows, how many columns?
Types — what’s numeric, what’s text, what’s a date?
Missing — how much is missing, and where?
Distributions — what does each numeric column look like (histogram)?
Correlations — which features move together?

Minimal pandas/seaborn snippet:

import pandas as pd
import seaborn as sns

df = pd.read_csv("data.csv")

print(df.shape)               # rows, columns
print(df.dtypes)              # types
print(df.isna().mean() * 100) # % missing per column
df.hist(figsize=(12, 8))      # distributions
sns.heatmap(df.corr())        # correlation matrix

3. Data cleaning — the four big enemies

flowchart TB
  D["Raw data"] --> M["Missing<br/>values"]
  D --> P["Duplicates"]
  D --> O["Outliers"]
  D --> T["Typos &<br/>inconsistencies"]
  M --> C["Clean dataset"]
  P --> C
  O --> C
  T --> C

The four cleanliness problems you will face on every dataset.

Missing values

Three strategies, in order of seriousness:

Strategy	When	Code
Drop the rows	Few missing rows, not biased	`df.dropna()`
Drop the column	A column is >50% empty	`df.drop(columns=['col'])`
Impute	Need to keep the data	`df['age'].fillna(df['age'].median())`

The honest rule: never impute the target y — better to drop those rows.

Duplicates

df = df.drop_duplicates()

Watch for near-duplicates too (same customer, slightly different name). Standardise text (lowercase, strip spaces) before deduping.

Outliers

Decision: are they errors (drop) or rare-but-real (keep)?

import numpy as np
q1, q3 = df['price'].quantile([0.25, 0.75])
iqr = q3 - q1
mask = (df['price'] >= q1 - 1.5*iqr) & (df['price'] <= q3 + 1.5*iqr)
df = df[mask]   # only if you decided they are errors

Typos and inconsistencies

"USA", "U.S.A.", "united states", "Etats-Unis" are the same country. Normalise:

df['country'] = df['country'].str.lower().str.strip()
mapping = {"u.s.a.": "usa", "united states": "usa", "etats-unis": "usa"}
df['country'] = df['country'].replace(mapping)

The “smell test”

Before moving on, do a final eyeball:

df.sample(20)        # 20 random rows
df.describe()        # min, max, mean — anything obviously wrong?
df.dtypes            # everything in the type you expect?

If anything looks suspicious — go back and clean again. It is much cheaper than a model that ships nonsense.

Key takeaways

Data is the ceiling — the model can only be as good as what you feed it.
EDA = 5 questions: shape, types, missing, distributions, correlations.
Four enemies of cleanliness: missing, duplicates, outliers, typos.
Never impute the target. Better to drop those rows.
Always do a final “smell test” — random sample + describe.

Next: Features — target, encoding, scaling — turning a clean table into something a model can eat.