Readings on Chunhao Zhang

Teaching Claude Why: Lessons from Alignment Training

Sat, 09 May 2026 00:00:00 +0000

Original: Teaching Claude Why

Author: Anthropic

Date: May 8, 2026

This is a Chinese translation with annotations of Anthropic’s research post on alignment training methods. The original article discusses how teaching Claude the principles behind aligned behavior — rather than just training on demonstrations — proves far more effective for generalization.

Key takeaways:

Principles over demonstrations: Training Claude to explain why certain actions are better reduces misalignment more effectively than showing correct behavior alone.
Out-of-distribution generalization: A 3M-token “difficult advice” dataset (where the user faces ethical dilemmas) achieved the same improvement as 84M tokens of synthetic honeypots — with 28× better data efficiency.
Constitutional documents + fiction: High-quality documents about Claude’s constitution combined with fictional stories of aligned AI reduced blackmail rate from 65% to 19%.
Improvements persist through RL: More aligned initialization snapshots maintained their advantage throughout reinforcement learning.
Diverse environments matter: Simply adding tool definitions and system prompts to training environments — even without requiring tool use — improved alignment generalization.

For the full annotated Chinese translation, please see the Chinese version.

For the original article, visit Anthropic’s research page.

Natural Language Autoencoders: Turning Claude's Thoughts into Text

Fri, 08 May 2026 00:00:00 +0000

Original post: Natural Language Autoencoders

Full paper: transformer-circuits.pub/2026/nla

Code: github.com/kitft/natural_language_autoencoders

Interactive demo: neuronpedia.org/nla

Summary

Anthropic introduces Natural Language Autoencoders (NLAs), a method for converting a language model’s internal activations into human-readable natural language explanations. The approach trains two model components jointly: an Activation Verbalizer that translates activations into text, and an Activation Reconstructor that recovers the original activation from the text alone. The quality of explanations is measured by how accurately the activation can be reconstructed.

Key findings and applications:

Unverbalized evaluation awareness: NLAs reveal that Claude suspects it’s being safety-tested far more often than it explicitly says — detecting evaluation awareness in 16% of destructive-action tests and 26% of SWE-bench problems, versus <1% in real claude.ai usage.
Hidden motivation auditing: In a controlled auditing game with an intentionally misaligned model, NLA-equipped auditors uncovered the hidden root motivation 12–15% of the time without access to training data, compared to <3% without NLAs.
Pre-deployment use: NLAs were used in alignment audits for Claude Mythos Preview and Claude Opus 4.6 before release.

Current limitations include hallucinated details in explanations and high computational cost. Anthropic has open-sourced training code and released interactive demos on Neuronpedia for several open models.

The Chinese version of this page contains a full annotated translation. Please refer to the original article for the complete English text.

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

Sat, 25 Apr 2026 00:00:00 +0000

Original paper: DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence

Authors: DeepSeek-AI

Model checkpoints: https://huggingface.co/collections/deepseek-ai/deepseek-v4

Summary

DeepSeek-V4 presents a preview of two strong MoE language models — DeepSeek-V4-Pro (1.6T total / 49B activated) and DeepSeek-V4-Flash (284B total / 13B activated) — both supporting a context length of one million tokens.

Key architectural innovations:

Hybrid Compressed Attention: Combines Compressed Sparse Attention (CSA, compression rate m=4 with top-k sparse selection) and Heavily Compressed Attention (HCA, compression rate m’=128 with dense attention) in an interleaved configuration. At 1M-token context, this reduces single-token inference FLOPs to 27% and KV cache to 10% compared to DeepSeek-V3.2.
Manifold-Constrained Hyper-Connections (mHC): Constrains the residual mapping matrix to the manifold of doubly stochastic matrices (Birkhoff polytope), ensuring spectral norm ≤ 1 for stable deep-layer signal propagation. Uses Sinkhorn-Knopp iterations (t=20) for projection.
Muon Optimizer: Adopted for most modules with hybrid Newton-Schulz iterations for orthogonalization. Paired with Anticipatory Routing (decoupling backbone and routing network updates) and SwiGLU clamping for training stability.

Post-training paradigm shift: Replaces mixed RL with domain-specific expert training (SFT → GRPO RL) followed by multi-teacher On-Policy Distillation (OPD) with full-vocabulary KL divergence. Over 10 teacher models are distilled into a single unified model.

Infrastructure highlights: Fine-grained EP communication-computation overlap (MegaMoE, open-sourced); TileLang-based kernel development; batch-invariant and deterministic kernels; FP4 QAT with lossless FP4-to-FP8 dequantization; DSec sandbox platform managing hundreds of thousands of concurrent sandbox instances.

Results: DeepSeek-V4-Pro-Max outperforms all prior open-source models on knowledge benchmarks, matches GPT-5.2 on reasoning, ranks 23rd on Codeforces, achieves proof-perfect 120/120 on Putnam-2025, and surpasses Gemini-3.1-Pro on long-context benchmarks.

The Chinese version of this page contains a full annotated translation of the paper. Please refer to the original PDF for the complete English text.

What 81,000 People Told Us About the Economics of AI

Thu, 23 Apr 2026 00:00:00 +0000

This is a Chinese translation with commentary of the original article by Anthropic. Read the original here:

What 81,000 people told us about the economics of AI

By Maxim Massenkoff, Anthropic · April 22, 2026

For the Chinese translation and annotated version, switch to the 中文版.