Text Generation

What

Autoregressive generation: predict one token at a time, feed it back, repeat.

Sampling strategies

MethodWhatTradeoff
GreedyAlways pick highest probability tokenRepetitive, boring
TemperatureScale logits before softmax. Low = confident, high = creativeControls randomness
Top-kSample from top k tokensFilters low-prob tokens
Top-p (nucleus)Sample from smallest set with cumulative prob ≥ pAdaptive filtering
from transformers import pipeline
 
gen = pipeline("text-generation", model="gpt2")
gen("The meaning of life is", max_new_tokens=50, temperature=0.7, top_p=0.9)

Instead of picking one token at a time (greedy), keep track of the top-k partial sequences (beams) and expand them all. Pick the highest-scoring complete sequence at the end.

  • Good for tasks with a “correct” answer (translation, summarization)
  • Less useful for open-ended generation (tends to be generic)
  • num_beams=5 is a common default

Repetition penalties

Models love repeating themselves. Fixes:

  • Repetition penalty: scale down logits of already-generated tokens (typical value: 1.1-1.3)
  • No-repeat n-gram: block exact n-gram repeats (no_repeat_ngram_size=3)
  • Frequency/presence penalty: OpenAI-style — penalize based on how often a token appeared

KV cache

During autoregressive generation, each new token needs attention over all previous tokens. The KV cache stores the key/value matrices from previous steps so you don’t recompute them. This turns generation from O(n²) to O(n) per token. Every production inference engine uses this.

Key concepts

  • Autoregressive: each token depends on all previous tokens
  • Context window: max tokens the model can see (512 to 1M+)
  • Prompt: the input text that conditions generation
  • Stop tokens: signal the model to stop generating

Structured generation

Sometimes you need output in a specific format, not freeform text:

  • JSON mode: constrain output to valid JSON (OpenAI, vLLM support this)
  • Grammar-based sampling: define a formal grammar (GBNF), model can only generate tokens that match it
  • Outlines / guidance: libraries that mask invalid tokens at each step, guaranteeing valid structured output

Modern techniques

  • Speculative decoding: use a small fast model to draft tokens, then verify with the large model in one forward pass. Speeds up inference 2-3x with no quality loss
  • Guided generation: constrain the model to follow schemas, regex patterns, or tool-call formats at the token level
  • Parallel decoding: methods like Medusa add extra prediction heads to generate multiple tokens per step