Reinforcement learning (RL) trains an agent to take actions in an environment to maximize cumulative reward. Unlike supervised learning, correct actions are not labeled directly—feedback arrives over time.
RL loop
- Agent observes state
- Chooses action
- Environment returns reward and next state
- Policy updates to favor higher long-term reward
Examples: game playing, robotics, ad bidding, recommendation exploration.
Exploration vs exploitation
Agent must try new actions (explore) to discover better strategies while using known good actions (exploit). Too much exploration wastes reward; too little misses improvements.
Product caution
- RL needs simulators or safe live experiments—mistakes can be costly
- Reward hacking: optimizing proxy metric harms real goals
- Often hybrid: supervised warm-start + bandits for online learning
Important interview questions and answers
- Q: RL vs supervised?
A: Supervised has correct output per example; RL learns from delayed scalar rewards. - Q: Reward hacking?
A: Agent maximizes metric without achieving intended business outcome.
Self-check
- Name the four parts of the RL loop.
- Why is exploration necessary?
Pitfall: Proxy rewards that look good in sim but harm real users—validate on business KPIs.
Interview prep
- RL feedback shape?
- Delayed scalar rewards, not per-example correct labels.
- Exploration vs exploitation?
- Balance trying new actions vs using known good policies.