scikit-learn and Pandas preview

Last reviewed May 28, 2026 Content v20260528

Track mode: server_script
Means: Server runner
Reading: ~1 min
Level: intermediate

This lesson

This lesson teaches scikit-learn and Pandas preview: Pandas tabular manipulation—indexing, dtypes, reshaping, and analysis habits for real-world tables.

This track orients workflow; NumPy/Pandas tracks teach the tools you will use daily in notebooks.

You will apply scikit-learn and Pandas preview in contexts like: Train/test feature matrices built from wrangled DataFrames.

Read the narrative, run `import pandas as pd` snippets with in-memory DataFrames (install pandas and numpy with pip if needed), inspect `.head()`, `.dtypes`, and complete MCQs.

Toward the end—consolidate before SciPy, sklearn-heavy projects, and interview prep.

scikit-learn estimators accept NumPy arrays; Pandas DataFrames work when all columns are numeric. Export with to_numpy(), keep feature names in ColumnTransformer pipelines, and never leak test statistics into train features.

Feature matrix convention

import pandas as pd
import numpy as np

df = pd.DataFrame({'age': [25, 30], 'income': [50000, 60000]})
X = df.to_numpy()  # shape (n_samples, n_features)
print('X shape:', X.shape)

Pipeline pattern

Split train/test before fit
Fit scalers/encoders on train only
Use ColumnTransformer for mixed numeric/categorical columns
Keep column names in a list parallel to X columns

Common pitfall

Fitting StandardScaler on full dataset before split leaks future information—always fit on train, transform both.

Important interview questions and answers

Q: X shape?
A: (n_samples, n_features)—rows are observations, columns are features.
Q: Categorical columns?
A: One-hot encode or ordinal encode before sklearn—does not accept raw strings.

Self-check

Export a numeric DataFrame to X matrix.
Why fit scaler on train only?

Pitfall: Fit scalers and encoders on train split only—never on full data before split.

Interview prep

X shape?: (n_samples, n_features)—rows observations, columns features.
Leakage?: Never fit scaler/encoder on full dataset before train/test split.

Interview tip Lesson completion confidence

Can you explain this lesson in 30 seconds without reading notes?

Self-reflection (saved on this device)

Not saved yet.

Playground

Runs on the configured server runner (dev: npm run runner with LEARNING_RUNNER_ENABLED=true). Output appears below the editor.

Code runner not available

Server runner is disabled. Set LEARNING_RUNNER_ENABLED=true and LEARNING_RUNNER_URL in .env (see .env.example).

Check yourself

Multiple choice — immediate feedback.

Discussion

Past discussion is visible to everyone. Only logged-in users can post comments and replies.

Starter discussion topics

X y from columns?
Train leakage cols?

No discussion yet. Be the first to ask a question.