scikit-learn estimators accept NumPy arrays; Pandas DataFrames work when all columns are numeric. Export with to_numpy(), keep feature names in ColumnTransformer pipelines, and never leak test statistics into train features.
Feature matrix convention
import pandas as pd
import numpy as np
df = pd.DataFrame({'age': [25, 30], 'income': [50000, 60000]})
X = df.to_numpy() # shape (n_samples, n_features)
print('X shape:', X.shape)
Pipeline pattern
- Split train/test before fit
- Fit scalers/encoders on train only
- Use
ColumnTransformerfor mixed numeric/categorical columns - Keep column names in a list parallel to X columns
Common pitfall
Fitting StandardScaler on full dataset before split leaks future information—always fit on train, transform both.
Important interview questions and answers
- Q: X shape?
A: (n_samples, n_features)—rows are observations, columns are features. - Q: Categorical columns?
A: One-hot encode or ordinal encode before sklearn—does not accept raw strings.
Self-check
- Export a numeric DataFrame to X matrix.
- Why fit scaler on train only?
Pitfall: Fit scalers and encoders on train split only—never on full data before split.
Interview prep
- X shape?
(n_samples, n_features)—rows observations, columns features.
- Leakage?
Never fit scaler/encoder on full dataset before train/test split.