A repeatable builder workflow keeps experiments from becoming unmonitored chat toys in production.
Seven stages
- Problem — user job, success metric, harm metric
- Data — what may enter prompts; retention rules
- Baseline — templates, search-only, or smaller model
- Prototype — prompts + optional RAG in staging
- Evaluate — golden sets, human rubrics, regression tests
- Guard — moderation, PII filters, rate limits
- Ship + monitor — cost, latency, drift, incidents
Artifacts to maintain
- Prompt templates versioned in git
- Retrieval corpus with source-of-truth owners
- Evaluation notebook or CI job with fixed seeds
- Runbook for model outage (fallback copy)
Link to data science habits
Train/validation leakage lessons from Data Science apply to RAG eval sets—do not tune prompts on the same queries you report as final scores.
Important interview questions and answers
- Q: What is a harm metric?
A: A measure of bad outcomes—toxic output, privacy leak, wrong medical advice—not only user satisfaction.
Self-check
- List the seven workflow stages.
- Why version prompts in git?
Challenge
Map one assistant you use
- Pick a real Gen AI product.
- Label each of the seven workflow stages on it.
- Write one harm metric they should track.
Done when: you can point to problem, data, eval, and guard stages on a real product.
Interview prep
- Harm metric?
Measures bad outcomes—leaks, toxicity, wrong policy advice—not only thumbs-up.
- Baseline why?
Proves Gen AI beats templates/search before accepting cost and risk.