Data cleaning transforms raw tables into analysis-ready datasets: correct types, handle missing values, remove duplicates, fix units, and document every rule so teammates can reproduce your work.
Cleaning in the lifecycle
Cleaning sits after initial EDA and before modeling splits:
- Inventory — data dictionary, sources, refresh cadence
- Validate — types, ranges, primary keys
- Transform — impute, encode, scale (as needed)
- Verify — row counts, spot checks, unit tests on pipelines
- Version — save cleaned snapshot or pipeline commit hash
SQL vs Python
- SQL — dedupe joins, filter bad rows in warehouse, cast types in views
- Python — complex string parsing, ML-oriented encodings, notebook experiments
Production pipelines often clean in SQL then validate in Python—both skills matter.
Cleaning log template
Keep a running log:
- Rule: “Drop rows where signup_date is null” — 1,204 rows removed
- Rule: “Impute age median within country” — applied before split
Leakage reminder
Do not use target or post-outcome columns to impute features. Fit imputers on training data only after train/test split.
Important interview questions and answers
- Q: Cleaning vs EDA?
A: EDA discovers issues; cleaning applies agreed fixes with documented rules. - Q: Why version cleaned data?
A: Reproducibility and debugging when metrics shift after a pipeline change.
Self-check
- List five cleaning workflow steps.
- When might SQL handle cleaning before Python?
- Why log each cleaning rule?
Tip: Clean on train only—fit imputers without test leakage.
Interview prep
- Fit on train?
Learn imputation/scaling from training only to avoid leakage.