A repeatable loop: define question → audit data dictionary → explore → clean → split train/test → baseline → iterate → document → ship recommendation.
Document the question
Write: Who decides? What action changes? How will we measure success? Vague goals produce vague analysis.
Audit before plotting
- Row count, column types, missing %
- Unit of observation (user vs session vs order)
- Time range and known collection bugs
Reproducibility habits
import random
random.seed(42)
print('Seed set — same random split in reruns')
Important interview questions and answers
- Q: Unit of observation?
A: Grain of one row—mixing users and sessions causes wrong aggregates. - Q: Why set random seed?
A: Makes train/test splits repeatable for debugging and audits.
Self-check
- List three audit checks before modeling.
- Why document the business decision maker?
Challenge
Set a random seed
- Run the workflow lesson code.
- Change
random.seedand observe split differences in later labs.
Done when: you see seed output and understand why reproducibility matters.
Interview prep
- Random seed?
Makes stochastic steps reproducible for audits.
- Unit of observation?
Grain of one row—must match the business question.