Attention lets each token weigh other tokens in context—capturing long-range dependencies like pronouns and headings far above in a document.
Intuition
When generating the next word after France, attention can focus on capital earlier in the sentence—even if it was hundreds of tokens ago (within the context window).
Self-attention
Self-attention relates tokens within the same sequence. Stacked layers build hierarchical features—syntax, then semantics, then task-specific patterns.
Implications for builders
- Long prompts cost more compute (quadratic attention in naive form; optimizations exist)
- Put critical instructions where models attend reliably—often start of system message
- Do not assume the model "read" every retrieved chunk equally—reranking helps
Important interview questions and answers
- Q: Is attention the same as RAG?
A: No—attention is internal to the model; RAG adds external documents at inference time.
Self-check
- Why do long prompts cost more?
- What is self-attention?
Tip: Put must-follow rules in the system message at the top—attention is not uniform across 100k tokens.
Interview prep
- Self-attention?
Tokens in one sequence attend to each other to build contextual representations.
- Long prompt cost?
Attention compute grows with context length—impacts latency and price.