pd.merge combines DataFrames on key columns—like SQL JOIN. Specify how='inner'|'left'|'right'|'outer' and validate row counts after merge to catch duplicate keys.
Basic merge
import pandas as pd
left = pd.DataFrame({'id': [1, 2], 'name': ['A', 'B']})
right = pd.DataFrame({'id': [1, 3], 'score': [90, 85]})
inner = pd.merge(left, right, on='id', how='inner')
print(inner)
Join types
| how | SQL equivalent |
|---|---|
| inner | INNER JOIN |
| left | LEFT JOIN |
| right | RIGHT JOIN |
| outer | FULL OUTER JOIN |
Duplicate keys
If keys repeat, merge produces Cartesian expansion—always check len(result) vs expected. Use validate='one_to_one' to catch mistakes early.
Important interview questions and answers
- Q: on vs left_on/right_on?
A: Use when key column names differ between DataFrames. - Q: merge vs join?
A: df.join is index-based; merge is column-key based—merge is more common.
Self-check
- Perform a left merge keeping all left rows.
- What happens with duplicate keys in both tables?
Pitfall: Check len(merged) after join—duplicate keys multiply rows silently.
Interview prep
- Inner vs left?
Inner keeps matches only; left keeps all left rows.
- Duplicate keys?
Cartesian expansion—inflate row count; use validate= to catch.