Chunking splits documents into retrieval units. Bad chunks retrieve irrelevant text—answers sound confident and wrong.
Strategies
- Fixed size — e.g. 500 tokens with 50 overlap
- Structure-aware — headings, markdown sections, HTML blocks
- Semantic — split when embedding similarity drops
Overlap
Small overlap preserves sentences cut at boundaries. Too much overlap bloats storage and duplicates hits.
Metadata
chunk_meta = {
"source": "handbook-v3.pdf",
"page": 42,
"section": "Refunds",
"updated_at": "2026-01-15",
}
Important interview questions and answers
- Q: Why attach metadata?
A: Enables citations, ACL filtering, and freshness checks in the UI.
Self-check
- Name three chunking strategies.
- Why use overlap?
Tip: Prefer heading-based chunks for policies and docs—fixed 500 tokens splits tables badly.
Interview prep
- Overlap why?
Prevents sentences split across chunks from losing meaning at boundaries.
- Metadata why?
Citations, ACL filters, and freshness checks in the UI.