Teaching Claude Why: Lessons from Alignment Training

Sat, 09 May 2026 00:00:00 +0000

Original: Teaching Claude Why

Author: Anthropic

Date: May 8, 2026

This is a Chinese translation with annotations of Anthropic’s research post on alignment training methods. The original article discusses how teaching Claude the principles behind aligned behavior — rather than just training on demonstrations — proves far more effective for generalization.

Key takeaways:

Principles over demonstrations: Training Claude to explain why certain actions are better reduces misalignment more effectively than showing correct behavior alone.
Out-of-distribution generalization: A 3M-token “difficult advice” dataset (where the user faces ethical dilemmas) achieved the same improvement as 84M tokens of synthetic honeypots — with 28× better data efficiency.
Constitutional documents + fiction: High-quality documents about Claude’s constitution combined with fictional stories of aligned AI reduced blackmail rate from 65% to 19%.
Improvements persist through RL: More aligned initialization snapshots maintained their advantage throughout reinforcement learning.
Diverse environments matter: Simply adding tool definitions and system prompts to training environments — even without requiring tool use — improved alignment generalization.

For the full annotated Chinese translation, please see the Chinese version.

For the original article, visit Anthropic’s research page.

RLHF on Chunhao Zhang

Teaching Claude Why: Lessons from Alignment Training