After auditing missingness, choose a strategy per column: drop, impute, or model missingness explicitly. The right choice depends on how much is missing and why.
When to drop rows
- Very few rows missing and no pattern tied to target
- Critical identifier missing (cannot join or attribute)
Dropping many rows can bias results if missingness is not random.
Imputation options
- Numeric — median (robust), mean, or group-wise median by category
- Categorical — mode (most frequent) or explicit “unknown”
- Advanced — model-based imputation (use with care, fit on train only)
Indicators
Add revenue_was_missing flag columns when missingness may carry signal (optional survey questions, partial form completion).
Train-only fitting
# Conceptual pattern (after split):
# median_age = statistics.median(train_ages)
# for row in train: impute with median_age
# for row in test: use same median_age from train
Important interview questions and answers
- Q: Why median for skewed numeric?
A: Less pulled by outliers than mean—common default for imputation. - Q: Impute on full dataset risk?
A: Test information leaks into training via global statistics—inflate metrics.
Self-check
- Name two imputation strategies for categoricals.
- When is dropping rows reasonable?
- Why fit imputation on training data only?
Tip: Document imputation strategy in the README.
Interview prep
- Drop vs impute?
Drop when few rows; impute with care and documentation.