A train/test split reserves held-out data to estimate how models generalize. Training only on part of the data reduces overfitting to quirks you would not see in production.
Typical ratios
- 80/20 or 70/30 — common starting points for medium datasets
- Small data — cross-validation instead of single split (next modeling lessons)
Stratified split
For classification, keep class proportions similar in train and test—important when labels are imbalanced.
Time-based splits
For forecasting, random shuffles leak future into past. Split by time: train on older dates, test on recent.
Random seed
import random
random.seed(42)
# shuffle indices, cut at split pointSame seed → same split for debugging. Document seed in experiment logs.
Important interview questions and answers
- Q: Why hold out test data?
A: Unbiased estimate of performance on unseen rows—tune on train/validation only. - Q: Data leakage in split?
A: Duplicates across train/test (same user twice) inflate metrics—dedupe first.
Self-check
- What is the purpose of a test set?
- When should you split by time instead of randomly?
- Why set random.seed before splitting?
Pitfall: Random split on time-series data causes leakage—use time-based split.
Interview prep
- Test set purpose?
Unseen estimate of generalization—not for tuning.