Teaching Claude Why: Lessons from Alignment Training

Original: Teaching Claude Why Author: Anthropic Date: May 8, 2026 This is a Chinese translation with annotations of Anthropic’s research post on alignment training methods. The original article discusses how teaching Claude the principles behind aligned behavior — rather than just training on demonstrations — proves far more effective for generalization. Key takeaways: Principles over demonstrations: Training Claude to explain why certain actions are better reduces misalignment more effectively than showing correct behavior alone. Out-of-distribution generalization: A 3M-token “difficult advice” dataset (where the user faces ethical dilemmas) achieved the same improvement as 84M tokens of synthetic honeypots — with 28× better data efficiency. Constitutional documents + fiction: High-quality documents about Claude’s constitution combined with fictional stories of aligned AI reduced blackmail rate from 65% to 19%. Improvements persist through RL: More aligned initialization snapshots maintained their advantage throughout reinforcement learning. Diverse environments matter: Simply adding tool definitions and system prompts to training environments — even without requiring tool use — improved alignment generalization. For the full annotated Chinese translation, please see the Chinese version. ...

May 9, 2026 · 1 min · 182 words · Chunhao Zhang

Natural Language Autoencoders: Turning Claude's Thoughts into Text

Original post: Natural Language Autoencoders Full paper: transformer-circuits.pub/2026/nla Code: github.com/kitft/natural_language_autoencoders Interactive demo: neuronpedia.org/nla Summary Anthropic introduces Natural Language Autoencoders (NLAs), a method for converting a language model’s internal activations into human-readable natural language explanations. The approach trains two model components jointly: an Activation Verbalizer that translates activations into text, and an Activation Reconstructor that recovers the original activation from the text alone. The quality of explanations is measured by how accurately the activation can be reconstructed. ...

May 8, 2026 · 2 min · 218 words · Chunhao Zhang