Split data so you train parameters, tune on validation, and report final performance on a held-out test set touched once. Random splits fail when data has time or group structure.
Three sets
- Train — fit model weights
- Validation — pick hyperparameters, early stopping
- Test — unbiased estimate before launch (use sparingly)
Split strategies
- Random — IID rows (rare in production)
- Time-based — train on past, validate on future
- Group — all rows from one user in one split only
Split pseudocode
# 70/15/15 split concept
n = 1000
train_end = int(n * 0.70)
val_end = int(n * 0.85)
print("train:", train_end, "val:", val_end - train_end, "test:", n - val_end)Practice: Optional snippets use pandas-style pseudocode—run with Pandas locally if you want tactile practice.
Important interview questions and answers
- Q: Why not tune on test?
A: Test becomes validation—optimistic bias on final metrics. - Q: Time split when?
A: User behavior drifts; future must not appear in training features.
Self-check
- What is each split used for?
- When use group split instead of random?
Tip: Use time-based splits when user behavior drifts seasonally.
Interview prep
- Validation purpose?
- Tune hyperparameters and early stopping without touching test.
- Time-based split when?
- Temporal drift—train on past, validate on future periods.