Feature engineering transforms raw records into columns the model can use. Labels define what you predict—must align with the product decision and be measurable without leakage from the future.
Feature examples
- Tabular: age, tenure_days, avg_order_value
- Text: token counts, embeddings from pretrained encoders
- Time: hour_of_day, days_since_last_login
- Categorical: one-hot or learned embeddings
Label design
For churn: label = canceled within 30 days after snapshot date. Bad label: includes events before snapshot that reveal the future (leakage). Always ask: could this feature exist at prediction time?
Feature matrix preview
# Conceptual feature row
feature_row = {
"tenure_days": 120,
"orders_last_30d": 3,
"label_churn_30d": 0, # 0 = stayed, 1 = churned
}
print(feature_row.keys())Practice: Optional snippets use pandas-style pseudocode—run with Pandas locally if you want tactile practice.
Important interview questions and answers
- Q: Label leakage?
A: Feature or label uses information unavailable at inference time—inflates offline metrics. - Q: Embeddings as features?
A: Dense vectors capturing semantic similarity—common in search and Gen AI pipelines.
Self-check
- Define leakage in one sentence.
- Name two feature types for a subscription product.
Pitfall: Label leakage from the future—ask "available at prediction time?" for every feature.
Interview prep
- Label leakage?
- Labels or features use future information unavailable at inference time.
- Embeddings as features?
- Dense vectors capturing semantic similarity for search and NLP.