The context window is the maximum tokens the model can consider in one request (prompt + completion). Exceeding it fails or truncates.
What fills the window
- System instructions
- Chat history
- RAG retrieved passages
- Tool outputs (JSON, logs)
- The answer being generated
Strategies when you hit limits
- Summarize older turns into a rolling memory block
- Retrieve fewer, higher-quality chunks
- Use a cheaper model to compress context first
- Split tasks across multiple calls with structured handoff
Lost in the middle
Research shows models may under-use information buried in the middle of very long contexts—put key facts near the start or repeat them in the user message.
Important interview questions and answers
- Q: Does a 128k window mean you should fill it?
A: No—cost, latency, and attention quality often favor shorter focused context.
Self-check
- List four things that consume context.
- One strategy to free tokens?
Tip: Summarize old chat turns instead of stuffing full history—quality often improves.
Interview prep
- Lost in the middle?
Models may under-use facts buried mid-context—repeat critical facts near start.
- Truncation strategy?
Summarize history, retrieve fewer chunks, or split tasks—not silent chop.