SQL in the data pipeline

Last reviewed May 28, 2026 Content v20260528

Track mode

server_script

Means

Server runner

Reading

~2 min

Level

intermediate

This lesson

This lesson teaches SQL in the data pipeline: the data science mindset, methods, and communication habits behind evidence-based decisions.

Warehouses aggregate at scale—SQL skills from /sql/intro keep pipelines efficient before pandas modeling.

You will apply SQL in the data pipeline in contexts like: ETL from warehouses, BI tools, and feature stores feeding Python notebooks.

Read the narrative, run Python in the playground (stdlib snippets now; install Jupyter, pandas, and scikit-learn locally for full notebooks), and complete MCQs to lock in vocabulary.

Toward the end—consolidate before NumPy/Pandas tracks, interview prep, and production checklist.

SQL is the lingua franca of warehouses: filter, join, aggregate, and materialize features at scale before Python modeling on samples or exports.

Typical pipeline

Ingest — events, CRM, logs into warehouse tables
Transform (SQL) — dbt/SQL views: clean keys, daily rollups, cohort flags
Export — training slice to parquet or pandas
Model (Python) — train, evaluate, register artifact
Score — batch SQL + UDF or online API

What belongs in SQL

Row-level filters, joins across large fact tables
Aggregations (DAU, revenue by segment)
Feature snapshots versioned as tables

Review SQL intro for SELECT, JOIN, GROUP BY—essential for analysts.

What belongs in Python

Complex text parsing, custom ML features
Model training and cross-validation
Prototyping before promoting logic to SQL

Example pattern

-- Warehouse: daily user features
SELECT user_id,
       DATE(event_time) AS day,
       COUNT(*) AS events
FROM app_events
GROUP BY user_id, DATE(event_time);

Then pd.read_sql() or export to parquet for modeling locally.

Important interview questions and answers

Q: Why SQL first at scale?
A: Engines push compute to data—Python loops over billions of rows fail.
Q: Feature store idea?
A: Centralized, versioned features for train and serve consistency—SQL often backs batch features.

Self-check

Sketch five pipeline stages from ingest to score.
What aggregations fit naturally in SQL?
Why export a sample to Python for modeling?

Tip: Push heavy aggregation to SQL; model on curated extracts.

Interview prep

SQL first?: Aggregate in warehouse; model on smaller curated extract.

Playground

Runs on the configured server runner (dev: npm run runner with LEARNING_RUNNER_ENABLED=true). Output appears below the editor.

Code runner not available

Server runner is disabled. Set LEARNING_RUNNER_ENABLED=true and LEARNING_RUNNER_URL in .env (see .env.example).

Discussion

Past discussion is visible to everyone. Only logged-in users can post comments and replies.

Starter discussion topics

SQL before pandas?
Aggregate in warehouse?

No discussion yet. Be the first to ask a question.