SQL is the lingua franca of warehouses: filter, join, aggregate, and materialize features at scale before Python modeling on samples or exports.
Typical pipeline
- Ingest — events, CRM, logs into warehouse tables
- Transform (SQL) — dbt/SQL views: clean keys, daily rollups, cohort flags
- Export — training slice to parquet or pandas
- Model (Python) — train, evaluate, register artifact
- Score — batch SQL + UDF or online API
What belongs in SQL
- Row-level filters, joins across large fact tables
- Aggregations (DAU, revenue by segment)
- Feature snapshots versioned as tables
Review SQL intro for SELECT, JOIN, GROUP BY—essential for analysts.
What belongs in Python
- Complex text parsing, custom ML features
- Model training and cross-validation
- Prototyping before promoting logic to SQL
Example pattern
-- Warehouse: daily user features
SELECT user_id,
DATE(event_time) AS day,
COUNT(*) AS events
FROM app_events
GROUP BY user_id, DATE(event_time);Then pd.read_sql() or export to parquet for modeling locally.
Important interview questions and answers
- Q: Why SQL first at scale?
A: Engines push compute to data—Python loops over billions of rows fail. - Q: Feature store idea?
A: Centralized, versioned features for train and serve consistency—SQL often backs batch features.
Self-check
- Sketch five pipeline stages from ingest to score.
- What aggregations fit naturally in SQL?
- Why export a sample to Python for modeling?
Tip: Push heavy aggregation to SQL; model on curated extracts.
Interview prep
- SQL first?
Aggregate in warehouse; model on smaller curated extract.