Readings

Teaching Claude Why: Lessons from Alignment Training

Original: Teaching Claude Why Author: Anthropic Date: May 8, 2026 This is a Chinese translation with annotations of Anthropic’s research post on alignment training methods. The original article discusses how teaching Claude the principles behind aligned behavior — rather than just training on demonstrations — proves far more effective for generalization. Key takeaways: Principles over demonstrations: Training Claude to explain why certain actions are better reduces misalignment more effectively than showing correct behavior alone. Out-of-distribution generalization: A 3M-token “difficult advice” dataset (where the user faces ethical dilemmas) achieved the same improvement as 84M tokens of synthetic honeypots — with 28× better data efficiency. Constitutional documents + fiction: High-quality documents about Claude’s constitution combined with fictional stories of aligned AI reduced blackmail rate from 65% to 19%. Improvements persist through RL: More aligned initialization snapshots maintained their advantage throughout reinforcement learning. Diverse environments matter: Simply adding tool definitions and system prompts to training environments — even without requiring tool use — improved alignment generalization. For the full annotated Chinese translation, please see the Chinese version. ...

Natural Language Autoencoders: Turning Claude's Thoughts into Text

Original post: Natural Language Autoencoders Full paper: transformer-circuits.pub/2026/nla Code: github.com/kitft/natural_language_autoencoders Interactive demo: neuronpedia.org/nla Summary Anthropic introduces Natural Language Autoencoders (NLAs), a method for converting a language model’s internal activations into human-readable natural language explanations. The approach trains two model components jointly: an Activation Verbalizer that translates activations into text, and an Activation Reconstructor that recovers the original activation from the text alone. The quality of explanations is measured by how accurately the activation can be reconstructed. ...

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

Original paper: DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence Authors: DeepSeek-AI Model checkpoints: https://huggingface.co/collections/deepseek-ai/deepseek-v4 Summary DeepSeek-V4 presents a preview of two strong MoE language models — DeepSeek-V4-Pro (1.6T total / 49B activated) and DeepSeek-V4-Flash (284B total / 13B activated) — both supporting a context length of one million tokens. Key architectural innovations: Hybrid Compressed Attention: Combines Compressed Sparse Attention (CSA, compression rate m=4 with top-k sparse selection) and Heavily Compressed Attention (HCA, compression rate m’=128 with dense attention) in an interleaved configuration. At 1M-token context, this reduces single-token inference FLOPs to 27% and KV cache to 10% compared to DeepSeek-V3.2. Manifold-Constrained Hyper-Connections (mHC): Constrains the residual mapping matrix to the manifold of doubly stochastic matrices (Birkhoff polytope), ensuring spectral norm ≤ 1 for stable deep-layer signal propagation. Uses Sinkhorn-Knopp iterations (t=20) for projection. Muon Optimizer: Adopted for most modules with hybrid Newton-Schulz iterations for orthogonalization. Paired with Anticipatory Routing (decoupling backbone and routing network updates) and SwiGLU clamping for training stability. Post-training paradigm shift: Replaces mixed RL with domain-specific expert training (SFT → GRPO RL) followed by multi-teacher On-Policy Distillation (OPD) with full-vocabulary KL divergence. Over 10 teacher models are distilled into a single unified model. ...

What 81,000 People Told Us About the Economics of AI

This is a Chinese translation with commentary of the original article by Anthropic. Read the original here: What 81,000 people told us about the economics of AI By Maxim Massenkoff, Anthropic · April 22, 2026 For the Chinese translation and annotated version, switch to the 中文版.