Merge and join

Last reviewed May 28, 2026 Content v20260528

Track mode: server_script
Means: Server runner
Reading: ~1 min
Level: intermediate

This lesson

This lesson teaches Merge and join: Pandas tabular manipulation—indexing, dtypes, reshaping, and analysis habits for real-world tables.

Many-to-one merge mistakes duplicate rows silently—analysts and ML engineers debug this weekly.

You will apply Merge and join in contexts like: Customer 360 tables, experiment cohort joins, and feature-store enrichment.

Read the narrative, run `import pandas as pd` snippets with in-memory DataFrames (install pandas and numpy with pip if needed), inspect `.head()`, `.dtypes`, and complete MCQs. Also verify row counts before and after joins or aggregations.

When you can explain the previous lesson's ideas in your own words.

pd.merge combines DataFrames on key columns—like SQL JOIN. Specify how='inner'|'left'|'right'|'outer' and validate row counts after merge to catch duplicate keys.

Basic merge

import pandas as pd
left = pd.DataFrame({'id': [1, 2], 'name': ['A', 'B']})
right = pd.DataFrame({'id': [1, 3], 'score': [90, 85]})
inner = pd.merge(left, right, on='id', how='inner')
print(inner)

Join types

how	SQL equivalent
inner	INNER JOIN
left	LEFT JOIN
right	RIGHT JOIN
outer	FULL OUTER JOIN

Duplicate keys

If keys repeat, merge produces Cartesian expansion—always check len(result) vs expected. Use validate='one_to_one' to catch mistakes early.

Important interview questions and answers

Q: on vs left_on/right_on?
A: Use when key column names differ between DataFrames.
Q: merge vs join?
A: df.join is index-based; merge is column-key based—merge is more common.

Self-check

Perform a left merge keeping all left rows.
What happens with duplicate keys in both tables?

Pitfall: Check len(merged) after join—duplicate keys multiply rows silently.

Interview prep

Inner vs left?: Inner keeps matches only; left keeps all left rows.
Duplicate keys?: Cartesian expansion—inflate row count; use validate= to catch.

Interview tip Lesson completion confidence

Can you explain this lesson in 30 seconds without reading notes?

Self-reflection (saved on this device)

Not saved yet.

Playground

Runs on the configured server runner (dev: npm run runner with LEARNING_RUNNER_ENABLED=true). Output appears below the editor.

Code runner not available

Server runner is disabled. Set LEARNING_RUNNER_ENABLED=true and LEARNING_RUNNER_URL in .env (see .env.example).

Check yourself

Multiple choice — immediate feedback.

Discussion

Past discussion is visible to everyone. Only logged-in users can post comments and replies.

Starter discussion topics

merge keys?
Many-to-many risk?

No discussion yet. Be the first to ask a question.