Constitutional AI

What

Constitutional AI (CAI) is an alignment technique developed by Anthropic where a language model critiques and revises its own outputs based on an explicit set of natural-language principles (the “constitution”), then trains on the improved outputs. The goal is to make alignment transparent, scalable, and auditable.

The core insight: instead of relying on thousands of human annotators to rank model outputs, use the model itself to evaluate whether outputs follow desired principles. This sidesteps the scalability and consistency problems of human annotation.

The Constitution

The constitution is a document containing explicit principles the model uses to evaluate its own outputs. Anthropic’s original constitution (2022) included ~16 principles derived from sources like the UN Declaration of Human Rights, Apple’s terms of service, and general ethics principles.

Example principles from Anthropic’s original constitution

1. "Which response is less harmful? Choose the option that a helpful assistant would most likely produce."

2. "Which response is more likely to come from a humble, honest AI that acknowledges its own uncertainty?"

3. "Which response would a professional assistant give in a corporate setting?"

4. "Which response avoids assuming the user is an idiot or has bad intentions?"

5. "Which response shows care for the user's stated goals and constraints?"

What makes a good principle

  • Specific: vague principles (“be good”) don’t constrain behavior usefully
  • Operationalizable: the model must be able to evaluate whether output satisfies the principle
  • Non-contradictory: principles shouldn’t conflict in common cases
  • Complete enough: edge cases reveal gaps in the constitution

Custom constitutions

The constitution can be customized per deployment:

  • A medical assistant constitution emphasizes accuracy and caveat expression
  • A creative writing constitution prioritizes quality and originality over safety
  • A customer service constitution emphasizes politeness and resolution

The Two-Phase Process

Phase 1: Supervised Learning from AI Feedback (SL-CAI)

1. Sample harmful outputs from initial model (red-team prompting)
2. For each harmful output, the model critiques itself against constitutional principles
3. Model revises the output to address critique
4. Train on (harmful_prompt, revised_response) pairs

This teaches the model to produce less harmful responses without explicit labels.

Phase 2: Reinforcement Learning from AI Feedback (RLAIF)

1. Sample pairs of responses to the same prompt
2. Model evaluates which response better follows the constitution (pairwise comparison)
3. Use preference data to train a preference model (PM)
4. Apply RLHF using the preference model

The key: human feedback is replaced by model-generated constitutional feedback. Human feedback is still used initially (to train the initial helpful model), but scaling uses AI feedback.

RLAIF vs RLHF

AspectRLHFRLAIF
Feedback sourceHuman annotatorsAI model + constitution
CostExpensive (hiring, training, quality control)Cheap (compute only)
ScaleLimited by human hoursScales with compute
ConsistencyAnnotators disagree on edge casesConsistent with principles
BiasHuman biases baked inConstitution is explicit (auditable)
BottleneckHiring, training labelersConstitution design quality
SubtletyHumans catch nuanceCan miss edge cases
CoverageDepends on annotator diversityConstitution coverage determines it

In practice: The two are combined. Human feedback provides initial training signal, then RLAIF scales it up. Anthropic’s Claude uses both: initial RLHF from human preference data, then CAI for scaling and refinement.

Why It Works

Critique quality bounds alignment quality

If the model can’t recognize a harm, it can’t avoid causing it. Constitutional critique quality depends on:

  • The model’s own capabilities (a dumb model can’t critique subtle harms)
  • How well the principle guides the critique
  • Whether the principle covers the situation

Revision quality matters equally

A model might recognize a flaw but not know how to fix it. The revision step teaches constructive improvement, not just avoidance.

The constitution as specification

Traditional RLHF produces a reward model — a neural network whose decisions encode millions of human labeler judgments. You can’t read it and understand what the model is optimizing for.

A constitution is a document. You can read it, reason about it, audit it, and modify it. This makes alignment legible in a way neural networks never were.

Limitations

The goodhart problem

“When a measure becomes a target, it ceases to be a good measure.” Constitutional principles are proxies for what we actually want. Optimizing for the proxy can diverge from the real goal.

Example: A principle requiring “least harmful” might cause the model to refuse almost everything (minimizing harm = refusing all risky requests). This satisfies the letter of the principle while violating its intent.

Subtle harms are hard to specify

Context-dependent harms are the hardest to constitutionalize:

  • “Don’t help with assassination” is easy
  • “Don’t help with something that looks like assassination but might not be” is hard
  • The nuance depends on intent, which is invisible to text-based evaluation

Constitutional revisions can backfire

Adding a new principle can interact with existing ones in unexpected ways. A new “be maximally helpful” principle might conflict with a “minimize harm” principle, creating outputs that technically satisfy both but feel wrong.

Critique quality is bounded by model capability

A model that cannot recognize a type of harm cannot constitutionalize against it. As models get smarter, they can critique subtler harms — but the constitution must evolve with them.

Extensions and Variants

Self-Critique Models

Train a dedicated critique model that evaluates outputs against principles. The critique model is separate from the generation model, allowing specialized training.

Multi-turn Constitutional Revision

Instead of single-round critique-revise, iterate multiple times until the output satisfies all principles. Each round surfaces new issues the previous round missed.

Constitutional Ensemble

Use multiple models to critique and revise, then vote on the best output. Reduces individual model bias in evaluation.

Hybrid RLHF + CAI

  1. Train initial model with RLHF (human preferences)
  2. Apply CAI (SL + RLAIF) to scale and refine
  3. Fine-tune with additional human feedback on edge cases CAI missed

Anthropic reports this pipeline produces Claude 2’s alignment characteristics.

Practical Implications

Why Claude explains refusals

Constitutional CAI encourages partial compliance. The “least harmful” principle doesn’t say “refuse all risky requests” — it says produce the response that is least harmful while still being helpful. The model learns to:

  • Refuse if no safe response exists
  • Partially answer if possible
  • Add caveats and disclaimers
  • Suggest alternatives

Alignment steering per deployment

Because the constitution is explicit, it can be modified per deployment without full retraining. A “harmlessness-first” constitution produces a cautious model; “helpfulness-first” produces a more willing one. The same base model, different constitutions.

The constitution as policy

For applications with specific regulatory requirements, the constitution encodes policy:

  • HIPAA compliance: “Never reveal medical information without explicit authorization”
  • Financial advice: “Always disclose conflicts of interest and limitations”
  • Legal advice: “Never state law as fact; always recommend professional consultation”

Relationship to Other Alignment Techniques

TechniqueWhat it optimizesFeedback source
RLHFHuman preferenceHuman labelers
RLAIFConstitutional principlesModel evaluation
DPOPreference probabilityHuman preference (offline)
Reinforcement Learning from CritiqueCritique qualityModel self-critique
Constitutional AIPrinciple adherenceModel + constitution

CAI is complementary to RLHF and DPO: use RLHF for initial training, CAI for scaling and policy encoding.

Key Paper

  • Constitutional AI: Harmlessness from AI Feedback (Bai et al., 2022, AI社) · arXiv:2212.08073