Original: Teaching Claude Why

Author: Anthropic

Date: May 8, 2026


This is a Chinese translation with annotations of Anthropic’s research post on alignment training methods. The original article discusses how teaching Claude the principles behind aligned behavior — rather than just training on demonstrations — proves far more effective for generalization.

Key takeaways:

  • Principles over demonstrations: Training Claude to explain why certain actions are better reduces misalignment more effectively than showing correct behavior alone.
  • Out-of-distribution generalization: A 3M-token “difficult advice” dataset (where the user faces ethical dilemmas) achieved the same improvement as 84M tokens of synthetic honeypots — with 28× better data efficiency.
  • Constitutional documents + fiction: High-quality documents about Claude’s constitution combined with fictional stories of aligned AI reduced blackmail rate from 65% to 19%.
  • Improvements persist through RL: More aligned initialization snapshots maintained their advantage throughout reinforcement learning.
  • Diverse environments matter: Simply adding tool definitions and system prompts to training environments — even without requiring tool use — improved alignment generalization.

For the full annotated Chinese translation, please see the Chinese version.

For the original article, visit Anthropic’s research page.