Modeling means fitting a mathematical or algorithmic pattern from features (inputs) to targets (outputs)—for prediction, ranking, or grouping. In data science, modeling is the step after clean data and EDA, not a substitute for understanding the business question.
Inputs and outputs
- Features (X) — columns available at prediction time
- Target (y) — what you want to predict or explain
- Baseline — simple rule to beat (majority class, mean value)
Model families (preview)
- Linear models — fast, interpretable coefficients
- Tree ensembles — strong tabular performance (random forest, gradient boosting)
- Neural networks — images, text, large unstructured data
Install scikit-learn locally; this track teaches concepts before deep library APIs.
Experiment discipline
- Define metric tied to business (precision at k, RMSE, calibration)
- Split data; tune on validation
- Report test metrics once at the end
- Document features, seed, and data snapshot
Python foundation
Models are trained in Python or exported from other tools—but evaluation and ethics thinking apply regardless of stack.
Important interview questions and answers
- Q: What is a baseline?
A: Naive predictor (always most common class) sets minimum performance before complex models. - Q: Features vs target?
A: Features are inputs; target is what you predict—must be available at scoring time without leakage.
Self-check
- Define features and target.
- Why establish a baseline?
- Name two model families for tabular data.
Tip: Start with logistic/linear baselines before ensembles.
Interview prep
- Supervised?
Labeled outcomes—predict target from features.