Modern LLMs are built on the Transformer architecture (2017)—parallel attention instead of slow recurrent loops—enabling training on web-scale text.
Encoder vs decoder
- Encoder-only (BERT-style) — great embeddings and classification
- Decoder-only (GPT-style) — autoregressive text generation
- Encoder–decoder (T5-style) — translation and summarization patterns
Chat LLMs you integrate are usually decoder-only.
Autoregressive generation
# Conceptual next-token loop
context = "The capital of France is"
# model outputs distribution over vocab; pick token (greedy or sample)
# append token, repeat until stop or max tokens
Why scale matters
More parameters and data improve fluency and reasoning on many benchmarks—but also increase cost, latency, and misuse potential. Product choice is not always the biggest model.
Important interview questions and answers
- Q: Which stack powers ChatGPT-style apps?
A: Decoder-only autoregressive transformers.
Self-check
- Encoder vs decoder-only use case?
- What does autoregressive mean?
Tip: Chat LLMs are decoder-only—encoder-only BERT is for embeddings/classification, not open-ended chat.
Interview prep
- Decoder-only?
Autoregressive chat models predict next token; GPT-style stacks dominate assistants.
- Autoregressive?
Each new token is conditioned on all prior tokens in context.