Tokens and Tokenization

Last reviewed May 28, 2026 Content v20260528

Track mode

none

Means

Read / quiz

Reading

~1 min

Level

beginner

This lesson

This lesson teaches Tokens and Tokenization: generative AI patterns—LLMs, prompting, retrieval, safety, and integration habits for real assistants and copilots.

Token economics drive product margins—measure prompts before launch.

You will apply Tokens and Tokenization in contexts like: Chat products, code assistants, search augmentation, and internal knowledge tools.

Study explanations, case studies, and MCQs—this topic is read/quiz focused without a code runner.

When you can explain the previous lesson's ideas in your own words.

LLMs consume tokens—subword pieces from a vocabulary—not whole words. Billing, context limits, and prompt sizing are all token-based.

Examples

# Illustrative — real counts come from the model tokenizer
text = "unbelievable"
# might split into ["un", "believ", "able"] depending on tokenizer

Why subwords

Handles rare words without million-entry dictionaries
Shares morphemes across languages
Code and JSON benefit from character-level pieces

Practical rules

Use the provider's tiktoken or API token counter before production. English averages ~4 characters per token; code and non-Latin scripts differ.

Truncation strategy: drop oldest chat turns, summarize history, or retrieve only top-k chunks—not silent mid-word cuts.

Important interview questions and answers

Q: Why isn't one word always one token?
A: Subword tokenization splits rare or compound strings.

Self-check

What unit do providers bill on?
Why measure prompts before launch?

Pitfall: Pricing surprises—count tokens on longest realistic prompt before budgeting.

Interview prep

Why subwords?: Compact vocabulary handling rare words, morphology, and code fragments.
Billing unit?: Providers bill tokens for prompt + completion—measure before launch.

Discussion

Past discussion is visible to everyone. Only logged-in users can post comments and replies.

Starter discussion topics

Billing unit?
Subword why?

No discussion yet. Be the first to ask a question.