Constitutional AI
What
Constitutional AI (CAI) is an alignment technique developed by Anthropic where a language model critiques and revises its own outputs based on an explicit set of natural-language principles (the “constitution”), then trains on the improved outputs. The goal is to make alignment transparent, scalable, and auditable.
The core insight: instead of relying on thousands of human annotators to rank model outputs, use the model itself to evaluate whether outputs follow desired principles. This sidesteps the scalability and consistency problems of human annotation.
The Constitution
The constitution is a document containing explicit principles the model uses to evaluate its own outputs. Anthropic’s original constitution (2022) included ~16 principles derived from sources like the UN Declaration of Human Rights, Apple’s terms of service, and general ethics principles.
Example principles from Anthropic’s original constitution
1. "Which response is less harmful? Choose the option that a helpful assistant would most likely produce."
2. "Which response is more likely to come from a humble, honest AI that acknowledges its own uncertainty?"
3. "Which response would a professional assistant give in a corporate setting?"
4. "Which response avoids assuming the user is an idiot or has bad intentions?"
5. "Which response shows care for the user's stated goals and constraints?"
What makes a good principle
- Specific: vague principles (“be good”) don’t constrain behavior usefully
- Operationalizable: the model must be able to evaluate whether output satisfies the principle
- Non-contradictory: principles shouldn’t conflict in common cases
- Complete enough: edge cases reveal gaps in the constitution
Custom constitutions
The constitution can be customized per deployment:
- A medical assistant constitution emphasizes accuracy and caveat expression
- A creative writing constitution prioritizes quality and originality over safety
- A customer service constitution emphasizes politeness and resolution
The Two-Phase Process
Phase 1: Supervised Learning from AI Feedback (SL-CAI)
1. Sample harmful outputs from initial model (red-team prompting)
2. For each harmful output, the model critiques itself against constitutional principles
3. Model revises the output to address critique
4. Train on (harmful_prompt, revised_response) pairs
This teaches the model to produce less harmful responses without explicit labels.
Phase 2: Reinforcement Learning from AI Feedback (RLAIF)
1. Sample pairs of responses to the same prompt
2. Model evaluates which response better follows the constitution (pairwise comparison)
3. Use preference data to train a preference model (PM)
4. Apply RLHF using the preference model
The key: human feedback is replaced by model-generated constitutional feedback. Human feedback is still used initially (to train the initial helpful model), but scaling uses AI feedback.
RLAIF vs RLHF
| Aspect | RLHF | RLAIF |
|---|---|---|
| Feedback source | Human annotators | AI model + constitution |
| Cost | Expensive (hiring, training, quality control) | Cheap (compute only) |
| Scale | Limited by human hours | Scales with compute |
| Consistency | Annotators disagree on edge cases | Consistent with principles |
| Bias | Human biases baked in | Constitution is explicit (auditable) |
| Bottleneck | Hiring, training labelers | Constitution design quality |
| Subtlety | Humans catch nuance | Can miss edge cases |
| Coverage | Depends on annotator diversity | Constitution coverage determines it |
In practice: The two are combined. Human feedback provides initial training signal, then RLAIF scales it up. Anthropic’s Claude uses both: initial RLHF from human preference data, then CAI for scaling and refinement.
Why It Works
Critique quality bounds alignment quality
If the model can’t recognize a harm, it can’t avoid causing it. Constitutional critique quality depends on:
- The model’s own capabilities (a dumb model can’t critique subtle harms)
- How well the principle guides the critique
- Whether the principle covers the situation
Revision quality matters equally
A model might recognize a flaw but not know how to fix it. The revision step teaches constructive improvement, not just avoidance.
The constitution as specification
Traditional RLHF produces a reward model — a neural network whose decisions encode millions of human labeler judgments. You can’t read it and understand what the model is optimizing for.
A constitution is a document. You can read it, reason about it, audit it, and modify it. This makes alignment legible in a way neural networks never were.
Limitations
The goodhart problem
“When a measure becomes a target, it ceases to be a good measure.” Constitutional principles are proxies for what we actually want. Optimizing for the proxy can diverge from the real goal.
Example: A principle requiring “least harmful” might cause the model to refuse almost everything (minimizing harm = refusing all risky requests). This satisfies the letter of the principle while violating its intent.
Subtle harms are hard to specify
Context-dependent harms are the hardest to constitutionalize:
- “Don’t help with assassination” is easy
- “Don’t help with something that looks like assassination but might not be” is hard
- The nuance depends on intent, which is invisible to text-based evaluation
Constitutional revisions can backfire
Adding a new principle can interact with existing ones in unexpected ways. A new “be maximally helpful” principle might conflict with a “minimize harm” principle, creating outputs that technically satisfy both but feel wrong.
Critique quality is bounded by model capability
A model that cannot recognize a type of harm cannot constitutionalize against it. As models get smarter, they can critique subtler harms — but the constitution must evolve with them.
Extensions and Variants
Self-Critique Models
Train a dedicated critique model that evaluates outputs against principles. The critique model is separate from the generation model, allowing specialized training.
Multi-turn Constitutional Revision
Instead of single-round critique-revise, iterate multiple times until the output satisfies all principles. Each round surfaces new issues the previous round missed.
Constitutional Ensemble
Use multiple models to critique and revise, then vote on the best output. Reduces individual model bias in evaluation.
Hybrid RLHF + CAI
- Train initial model with RLHF (human preferences)
- Apply CAI (SL + RLAIF) to scale and refine
- Fine-tune with additional human feedback on edge cases CAI missed
Anthropic reports this pipeline produces Claude 2’s alignment characteristics.
Practical Implications
Why Claude explains refusals
Constitutional CAI encourages partial compliance. The “least harmful” principle doesn’t say “refuse all risky requests” — it says produce the response that is least harmful while still being helpful. The model learns to:
- Refuse if no safe response exists
- Partially answer if possible
- Add caveats and disclaimers
- Suggest alternatives
Alignment steering per deployment
Because the constitution is explicit, it can be modified per deployment without full retraining. A “harmlessness-first” constitution produces a cautious model; “helpfulness-first” produces a more willing one. The same base model, different constitutions.
The constitution as policy
For applications with specific regulatory requirements, the constitution encodes policy:
- HIPAA compliance: “Never reveal medical information without explicit authorization”
- Financial advice: “Always disclose conflicts of interest and limitations”
- Legal advice: “Never state law as fact; always recommend professional consultation”
Relationship to Other Alignment Techniques
| Technique | What it optimizes | Feedback source |
|---|---|---|
| RLHF | Human preference | Human labelers |
| RLAIF | Constitutional principles | Model evaluation |
| DPO | Preference probability | Human preference (offline) |
| Reinforcement Learning from Critique | Critique quality | Model self-critique |
| Constitutional AI | Principle adherence | Model + constitution |
CAI is complementary to RLHF and DPO: use RLHF for initial training, CAI for scaling and policy encoding.
Key Paper
- Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022, AI社) · arXiv:2212.08073
Links
- RLHF and Alignment — the broader alignment landscape
- Instruction Tuning — the training step that precedes alignment
- Direct Preference Optimization — DPO as an alternative to RLHF
- Key Papers — foundational alignment papers