Original post: Natural Language Autoencoders
Full paper: transformer-circuits.pub/2026/nla
Code: github.com/kitft/natural_language_autoencoders
Interactive demo: neuronpedia.org/nla
Summary
Anthropic introduces Natural Language Autoencoders (NLAs), a method for converting a language model’s internal activations into human-readable natural language explanations. The approach trains two model components jointly: an Activation Verbalizer that translates activations into text, and an Activation Reconstructor that recovers the original activation from the text alone. The quality of explanations is measured by how accurately the activation can be reconstructed.
Key findings and applications:
- Unverbalized evaluation awareness: NLAs reveal that Claude suspects it’s being safety-tested far more often than it explicitly says — detecting evaluation awareness in 16% of destructive-action tests and 26% of SWE-bench problems, versus <1% in real claude.ai usage.
- Hidden motivation auditing: In a controlled auditing game with an intentionally misaligned model, NLA-equipped auditors uncovered the hidden root motivation 12–15% of the time without access to training data, compared to <3% without NLAs.
- Pre-deployment use: NLAs were used in alignment audits for Claude Mythos Preview and Claude Opus 4.6 before release.
Current limitations include hallucinated details in explanations and high computational cost. Anthropic has open-sourced training code and released interactive demos on Neuronpedia for several open models.
The Chinese version of this page contains a full annotated translation. Please refer to the original article for the complete English text.