Text Generation

What

Autoregressive generation: predict one token at a time, feed it back, repeat.

Sampling strategies

Method	What	Tradeoff
Greedy	Always pick highest probability token	Repetitive, boring
Temperature	Scale logits before softmax. Low = confident, high = creative	Controls randomness
Top-k	Sample from top k tokens	Filters low-prob tokens
Top-p (nucleus)	Sample from smallest set with cumulative prob ≥ p	Adaptive filtering

from transformers import pipeline
 
gen = pipeline("text-generation", model="gpt2")
gen("The meaning of life is", max_new_tokens=50, temperature=0.7, top_p=0.9)

Beam search

Instead of picking one token at a time (greedy), keep track of the top-k partial sequences (beams) and expand them all. Pick the highest-scoring complete sequence at the end.

Good for tasks with a “correct” answer (translation, summarization)
Less useful for open-ended generation (tends to be generic)
num_beams=5 is a common default

Repetition penalties

Models love repeating themselves. Fixes:

Repetition penalty: scale down logits of already-generated tokens (typical value: 1.1-1.3)
No-repeat n-gram: block exact n-gram repeats (no_repeat_ngram_size=3)
Frequency/presence penalty: OpenAI-style — penalize based on how often a token appeared

KV cache

During autoregressive generation, each new token needs attention over all previous tokens. The KV cache stores the key/value matrices from previous steps so you don’t recompute them. This turns generation from O(n²) to O(n) per token. Every production inference engine uses this.

Key concepts

Autoregressive: each token depends on all previous tokens
Context window: max tokens the model can see (512 to 1M+)
Prompt: the input text that conditions generation
Stop tokens: signal the model to stop generating

Structured generation

Sometimes you need output in a specific format, not freeform text:

JSON mode: constrain output to valid JSON (OpenAI, vLLM support this)
Grammar-based sampling: define a formal grammar (GBNF), model can only generate tokens that match it
Outlines / guidance: libraries that mask invalid tokens at each step, guaranteeing valid structured output

Modern techniques

Speculative decoding: use a small fast model to draft tokens, then verify with the large model in one forward pass. Speeds up inference 2-3x with no quality loss
Guided generation: constrain the model to follow schemas, regex patterns, or tool-call formats at the token level
Parallel decoding: methods like Medusa add extra prediction heads to generate multiple tokens per step

AI/ML Notes

Explorer

Text Generation

Text Generation

What

Sampling strategies

Beam search

Repetition penalties

KV cache

Key concepts

Structured generation

Modern techniques

Links

Graph View

Table of Contents

Backlinks