Original: Teaching Claude Why
Author: Anthropic
Date: May 8, 2026
This is a Chinese translation with annotations of Anthropic’s research post on alignment training methods. The original article discusses how teaching Claude the principles behind aligned behavior — rather than just training on demonstrations — proves far more effective for generalization.
Key takeaways:
- Principles over demonstrations: Training Claude to explain why certain actions are better reduces misalignment more effectively than showing correct behavior alone.
- Out-of-distribution generalization: A 3M-token “difficult advice” dataset (where the user faces ethical dilemmas) achieved the same improvement as 84M tokens of synthetic honeypots — with 28× better data efficiency.
- Constitutional documents + fiction: High-quality documents about Claude’s constitution combined with fictional stories of aligned AI reduced blackmail rate from 65% to 19%.
- Improvements persist through RL: More aligned initialization snapshots maintained their advantage throughout reinforcement learning.
- Diverse environments matter: Simply adding tool definitions and system prompts to training environments — even without requiring tool use — improved alignment generalization.
For the full annotated Chinese translation, please see the Chinese version.
For the original article, visit Anthropic’s research page.