Streaming sends tokens as they are generated—better UX for long answers; requires async handlers and careful error mid-stream.
Server-sent events
Clients append chunks to the UI. Handle disconnects—user may have partial answer saved.
Async Python sketch
# async def stream_answer():
# async for chunk in await client.chat.completions.create(..., stream=True):
# yield chunk.choices[0].delta.content or ""
Backpressure
Rate-limit concurrent streams per user; queue long jobs to workers for PDF summarization.
Important interview questions and answers
- Q: Streaming benefit?
A: Lower perceived latency; users start reading before full completion.
Self-check
- Why stream to the UI?
- What if the client disconnects mid-stream?
Tip: Save partial streams on disconnect—users still need the prefix for support tickets.
Interview prep
- Streaming UX?
Tokens arrive incrementally—lower perceived latency.
- Disconnect?
Persist partial output; allow resume or retry gracefully.