Interpretability

Original post: Natural Language Autoencoders Full paper: transformer-circuits.pub/2026/nla Code: github.com/kitft/natural_language_autoencoders Interactive demo: neuronpedia.org/nla Summary Anthropic introduces Natural Language Autoencoders (NLAs), a method for converting a language model’s internal activations into human-readable natural language explanations. The approach trains two model components jointly: an Activation Verbalizer that translates activations into text, and an Activation Reconstructor that recovers the original activation from the text alone. The quality of explanations is measured by how accurately the activation can be reconstructed. ...