Filter invalid rows, impute missing numeric values with a median, and keep a cleaning summary—stdlib Python on a list of dicts before you graduate to Pandas pipelines.
Scenario
User signup records: drop rows without country, impute missing age with training-set median (here: median of valid ages in the batch), normalize country codes to uppercase.
Pipeline steps in code
- Filter rows missing required fields
- Compute median age from remaining valid ages
- Fill missing ages with that median
- Print before/after counts
Production note
In jobs, persist rules in SQL views or Python transforms tested in CI—notebooks alone are not a pipeline.
Important interview questions and answers
- Q: Why uppercase country?
A: Consistent keys prevent duplicate categories IN vs in. - Q: Impute median in preview?
A: Demonstrates robust default; production stores median from train split only.
Self-check
- What rows does the filter remove?
- Which statistic imputes missing age?
- Why normalize country strings?
Tip: Compare row counts before and after filters.
Interview prep
- Filter rows?
List comprehensions remove invalid records in small examples.