Text Generation
What
Autoregressive generation: predict one token at a time, feed it back, repeat.
Sampling strategies
| Method | What | Tradeoff |
|---|---|---|
| Greedy | Always pick highest probability token | Repetitive, boring |
| Temperature | Scale logits before softmax. Low = confident, high = creative | Controls randomness |
| Top-k | Sample from top k tokens | Filters low-prob tokens |
| Top-p (nucleus) | Sample from smallest set with cumulative prob ≥ p | Adaptive filtering |
from transformers import pipeline
gen = pipeline("text-generation", model="gpt2")
gen("The meaning of life is", max_new_tokens=50, temperature=0.7, top_p=0.9)Beam search
Instead of picking one token at a time (greedy), keep track of the top-k partial sequences (beams) and expand them all. Pick the highest-scoring complete sequence at the end.
- Good for tasks with a “correct” answer (translation, summarization)
- Less useful for open-ended generation (tends to be generic)
num_beams=5is a common default
Repetition penalties
Models love repeating themselves. Fixes:
- Repetition penalty: scale down logits of already-generated tokens (typical value: 1.1-1.3)
- No-repeat n-gram: block exact n-gram repeats (
no_repeat_ngram_size=3) - Frequency/presence penalty: OpenAI-style — penalize based on how often a token appeared
KV cache
During autoregressive generation, each new token needs attention over all previous tokens. The KV cache stores the key/value matrices from previous steps so you don’t recompute them. This turns generation from O(n²) to O(n) per token. Every production inference engine uses this.
Key concepts
- Autoregressive: each token depends on all previous tokens
- Context window: max tokens the model can see (512 to 1M+)
- Prompt: the input text that conditions generation
- Stop tokens: signal the model to stop generating
Structured generation
Sometimes you need output in a specific format, not freeform text:
- JSON mode: constrain output to valid JSON (OpenAI, vLLM support this)
- Grammar-based sampling: define a formal grammar (GBNF), model can only generate tokens that match it
- Outlines / guidance: libraries that mask invalid tokens at each step, guaranteeing valid structured output
Modern techniques
- Speculative decoding: use a small fast model to draft tokens, then verify with the large model in one forward pass. Speeds up inference 2-3x with no quality loss
- Guided generation: constrain the model to follow schemas, regex patterns, or tool-call formats at the token level
- Parallel decoding: methods like Medusa add extra prediction heads to generate multiple tokens per step
Links
- Language Models
- Transformers
- Prompt Engineering
- Fine-Tuning LLMs
- Tokenization — how text becomes tokens before generation