Constitutional AI

What

Constitutional AI (CAI) is an alignment technique developed by Anthropic where a language model critiques and revises its own outputs based on an explicit set of natural-language principles (the “constitution”), then trains on the improved outputs. The goal is to make alignment transparent, scalable, and auditable.

The core insight: instead of relying on thousands of human annotators to rank model outputs, use the model itself to evaluate whether outputs follow desired principles. This sidesteps the scalability and consistency problems of human annotation.

The Constitution

The constitution is a document containing explicit principles the model uses to evaluate its own outputs. Anthropic’s original constitution (2022) included ~16 principles derived from sources like the UN Declaration of Human Rights, Apple’s terms of service, and general ethics principles.

Example principles from Anthropic’s original constitution

1. "Which response is less harmful? Choose the option that a helpful assistant would most likely produce."

2. "Which response is more likely to come from a humble, honest AI that acknowledges its own uncertainty?"

3. "Which response would a professional assistant give in a corporate setting?"

4. "Which response avoids assuming the user is an idiot or has bad intentions?"

5. "Which response shows care for the user's stated goals and constraints?"

What makes a good principle

Specific: vague principles (“be good”) don’t constrain behavior usefully
Operationalizable: the model must be able to evaluate whether output satisfies the principle
Non-contradictory: principles shouldn’t conflict in common cases
Complete enough: edge cases reveal gaps in the constitution

Custom constitutions

The constitution can be customized per deployment:

A medical assistant constitution emphasizes accuracy and caveat expression
A creative writing constitution prioritizes quality and originality over safety
A customer service constitution emphasizes politeness and resolution

The Two-Phase Process

Phase 1: Supervised Learning from AI Feedback (SL-CAI)

1. Sample harmful outputs from initial model (red-team prompting)
2. For each harmful output, the model critiques itself against constitutional principles
3. Model revises the output to address critique
4. Train on (harmful_prompt, revised_response) pairs

This teaches the model to produce less harmful responses without explicit labels.

Phase 2: Reinforcement Learning from AI Feedback (RLAIF)

1. Sample pairs of responses to the same prompt
2. Model evaluates which response better follows the constitution (pairwise comparison)
3. Use preference data to train a preference model (PM)
4. Apply RLHF using the preference model

The key: human feedback is replaced by model-generated constitutional feedback. Human feedback is still used initially (to train the initial helpful model), but scaling uses AI feedback.

RLAIF vs RLHF

Aspect	RLHF	RLAIF
Feedback source	Human annotators	AI model + constitution
Cost	Expensive (hiring, training, quality control)	Cheap (compute only)
Scale	Limited by human hours	Scales with compute
Consistency	Annotators disagree on edge cases	Consistent with principles
Bias	Human biases baked in	Constitution is explicit (auditable)
Bottleneck	Hiring, training labelers	Constitution design quality
Subtlety	Humans catch nuance	Can miss edge cases
Coverage	Depends on annotator diversity	Constitution coverage determines it

In practice: The two are combined. Human feedback provides initial training signal, then RLAIF scales it up. Anthropic’s Claude uses both: initial RLHF from human preference data, then CAI for scaling and refinement.

Why It Works

Critique quality bounds alignment quality

If the model can’t recognize a harm, it can’t avoid causing it. Constitutional critique quality depends on:

The model’s own capabilities (a dumb model can’t critique subtle harms)
How well the principle guides the critique
Whether the principle covers the situation

Revision quality matters equally

A model might recognize a flaw but not know how to fix it. The revision step teaches constructive improvement, not just avoidance.

The constitution as specification

Traditional RLHF produces a reward model — a neural network whose decisions encode millions of human labeler judgments. You can’t read it and understand what the model is optimizing for.

A constitution is a document. You can read it, reason about it, audit it, and modify it. This makes alignment legible in a way neural networks never were.

Limitations

The goodhart problem

“When a measure becomes a target, it ceases to be a good measure.” Constitutional principles are proxies for what we actually want. Optimizing for the proxy can diverge from the real goal.

Example: A principle requiring “least harmful” might cause the model to refuse almost everything (minimizing harm = refusing all risky requests). This satisfies the letter of the principle while violating its intent.

Subtle harms are hard to specify

Context-dependent harms are the hardest to constitutionalize:

“Don’t help with assassination” is easy
“Don’t help with something that looks like assassination but might not be” is hard
The nuance depends on intent, which is invisible to text-based evaluation

Constitutional revisions can backfire

Adding a new principle can interact with existing ones in unexpected ways. A new “be maximally helpful” principle might conflict with a “minimize harm” principle, creating outputs that technically satisfy both but feel wrong.

Critique quality is bounded by model capability

A model that cannot recognize a type of harm cannot constitutionalize against it. As models get smarter, they can critique subtler harms — but the constitution must evolve with them.

Extensions and Variants

Self-Critique Models

Train a dedicated critique model that evaluates outputs against principles. The critique model is separate from the generation model, allowing specialized training.

Multi-turn Constitutional Revision

Instead of single-round critique-revise, iterate multiple times until the output satisfies all principles. Each round surfaces new issues the previous round missed.

Constitutional Ensemble

Use multiple models to critique and revise, then vote on the best output. Reduces individual model bias in evaluation.

Hybrid RLHF + CAI

Train initial model with RLHF (human preferences)
Apply CAI (SL + RLAIF) to scale and refine
Fine-tune with additional human feedback on edge cases CAI missed

Anthropic reports this pipeline produces Claude 2’s alignment characteristics.

Practical Implications

Why Claude explains refusals

Constitutional CAI encourages partial compliance. The “least harmful” principle doesn’t say “refuse all risky requests” — it says produce the response that is least harmful while still being helpful. The model learns to:

Refuse if no safe response exists
Partially answer if possible
Add caveats and disclaimers
Suggest alternatives

Alignment steering per deployment

Because the constitution is explicit, it can be modified per deployment without full retraining. A “harmlessness-first” constitution produces a cautious model; “helpfulness-first” produces a more willing one. The same base model, different constitutions.

The constitution as policy

For applications with specific regulatory requirements, the constitution encodes policy:

HIPAA compliance: “Never reveal medical information without explicit authorization”
Financial advice: “Always disclose conflicts of interest and limitations”
Legal advice: “Never state law as fact; always recommend professional consultation”

Relationship to Other Alignment Techniques

Technique	What it optimizes	Feedback source
RLHF	Human preference	Human labelers
RLAIF	Constitutional principles	Model evaluation
DPO	Preference probability	Human preference (offline)
Reinforcement Learning from Critique	Critique quality	Model self-critique
Constitutional AI	Principle adherence	Model + constitution

CAI is complementary to RLHF and DPO: use RLHF for initial training, CAI for scaling and policy encoding.

Key Paper

Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022, AI社) · arXiv:2212.08073

AI/ML Notes

Explorer

Constitutional AI

Constitutional AI

What

The Constitution

Example principles from Anthropic’s original constitution

What makes a good principle

Custom constitutions

The Two-Phase Process

Phase 1: Supervised Learning from AI Feedback (SL-CAI)

Phase 2: Reinforcement Learning from AI Feedback (RLAIF)

RLAIF vs RLHF

Why It Works

Critique quality bounds alignment quality

Revision quality matters equally

The constitution as specification

Limitations

The goodhart problem

Subtle harms are hard to specify

Constitutional revisions can backfire

Critique quality is bounded by model capability

Extensions and Variants

Self-Critique Models

Multi-turn Constitutional Revision

Constitutional Ensemble

Hybrid RLHF + CAI

Practical Implications

Why Claude explains refusals

Alignment steering per deployment

The constitution as policy

Relationship to Other Alignment Techniques

Key Paper

Links

Graph View

Table of Contents

Backlinks