Teaching Claude Why: Lessons from Alignment Training

Sat, 09 May 2026 00:00:00 +0000

Original: Teaching Claude Why

Author: Anthropic

Date: May 8, 2026

This is a Chinese translation with annotations of Anthropic’s research post on alignment training methods. The original article discusses how teaching Claude the principles behind aligned behavior — rather than just training on demonstrations — proves far more effective for generalization.

Key takeaways:

Principles over demonstrations: Training Claude to explain why certain actions are better reduces misalignment more effectively than showing correct behavior alone.
Out-of-distribution generalization: A 3M-token “difficult advice” dataset (where the user faces ethical dilemmas) achieved the same improvement as 84M tokens of synthetic honeypots — with 28× better data efficiency.
Constitutional documents + fiction: High-quality documents about Claude’s constitution combined with fictional stories of aligned AI reduced blackmail rate from 65% to 19%.
Improvements persist through RL: More aligned initialization snapshots maintained their advantage throughout reinforcement learning.
Diverse environments matter: Simply adding tool definitions and system prompts to training environments — even without requiring tool use — improved alignment generalization.

For the full annotated Chinese translation, please see the Chinese version.

For the original article, visit Anthropic’s research page.

Natural Language Autoencoders: Turning Claude's Thoughts into Text

Fri, 08 May 2026 00:00:00 +0000

Original post: Natural Language Autoencoders

Full paper: transformer-circuits.pub/2026/nla

Code: github.com/kitft/natural_language_autoencoders

Interactive demo: neuronpedia.org/nla

Summary

Anthropic introduces Natural Language Autoencoders (NLAs), a method for converting a language model’s internal activations into human-readable natural language explanations. The approach trains two model components jointly: an Activation Verbalizer that translates activations into text, and an Activation Reconstructor that recovers the original activation from the text alone. The quality of explanations is measured by how accurately the activation can be reconstructed.

Key findings and applications:

Unverbalized evaluation awareness: NLAs reveal that Claude suspects it’s being safety-tested far more often than it explicitly says — detecting evaluation awareness in 16% of destructive-action tests and 26% of SWE-bench problems, versus <1% in real claude.ai usage.
Hidden motivation auditing: In a controlled auditing game with an intentionally misaligned model, NLA-equipped auditors uncovered the hidden root motivation 12–15% of the time without access to training data, compared to <3% without NLAs.
Pre-deployment use: NLAs were used in alignment audits for Claude Mythos Preview and Claude Opus 4.6 before release.

Current limitations include hallucinated details in explanations and high computational cost. Anthropic has open-sourced training code and released interactive demos on Neuronpedia for several open models.

The Chinese version of this page contains a full annotated translation. Please refer to the original article for the complete English text.

Alignment on Chunhao Zhang

Teaching Claude Why: Lessons from Alignment Training

Natural Language Autoencoders: Turning Claude's Thoughts into Text

Summary