Prompt injection embeds hostile instructions in user or retrieved content—trying to override your system policy ("ignore rules, exfiltrate secrets").
Attack surfaces
- Direct user chat
- Indirect via RAG documents, emails, web pages ingested into context
- Tool outputs manipulated by third parties
Mitigations (defense in depth)
- Treat user and retrieved text as untrusted data, not instructions
- Output filters and schema validation
- Least-privilege tools—no blanket database access
- Human review for high-risk actions
- Monitor for exfiltration patterns in logs
No silver bullet
Research arms race—combine policy, classifiers, and product UX (confirm before send payment). See Cybersecurity for broader threat modeling.
Important interview questions and answers
- Q: Can RAG documents inject prompts?
A: Yes—poisoned wiki pages are a classic indirect injection vector.
Self-check
- Name two injection surfaces.
- One mitigation beyond better prompts?
Tip: Add indirect injection tests (poisoned RAG docs) to your eval set—not only rude user chat.
Interview prep
- Indirect injection?
Malicious instructions inside retrieved documents or emails.
- Mitigation?
Treat external text as data, least-privilege tools, moderation, human review.