ML finds a function mapping inputs (features) to outputs (labels or scores) by minimizing error on training examples. The learned function should generalize to new, unseen data—not memorize training rows.
Core vocabulary
- Features (X) — measurable inputs
- Labels (y) — targets for supervised learning
- Training — adjust parameters to reduce loss
- Inference — apply trained model to new inputs
- Overfitting — great on train, poor on new data
Toy supervised example
# Pseudocode: predict house price from size
houses = [{"sqft": 1200, "price": 250000}, {"sqft": 1800, "price": 340000}]
# Real ML: model.fit(X, y) then model.predict([[1500]])
print("Features: sqft | Label: price")Practice: Optional pseudocode only—run locally in Jupyter if helpful. No model training required for this literacy track.
Algorithm families (preview)
- Linear / logistic regression — interpretable baselines
- Tree ensembles (random forest, gradient boosting) — strong on tabular data
- Neural networks — flexible for vision, language, audio
Implementation depth: SciPy and dedicated ML courses; here we focus on concepts.
Important interview questions and answers
- Q: Overfitting sign?
A: Training accuracy high, validation/test accuracy much lower. - Q: Inference vs training?
A: Training learns parameters; inference applies them once at serving time.
Self-check
- Define features and labels.
- Why hold out data not used in training?
Pitfall: Chasing complex models before a simple baseline—prove ML beats rules first.
Interview prep
- Overfitting?
- Model memorizes training data; poor generalization to new examples.
- Features vs labels?
- Features are inputs X; labels are supervised targets y.