MoE | Chunhao Zhang

Original paper: DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence Authors: DeepSeek-AI Model checkpoints: https://huggingface.co/collections/deepseek-ai/deepseek-v4 Summary DeepSeek-V4 presents a preview of two strong MoE language models — DeepSeek-V4-Pro (1.6T total / 49B activated) and DeepSeek-V4-Flash (284B total / 13B activated) — both supporting a context length of one million tokens. Key architectural innovations: Hybrid Compressed Attention: Combines Compressed Sparse Attention (CSA, compression rate m=4 with top-k sparse selection) and Heavily Compressed Attention (HCA, compression rate m’=128 with dense attention) in an interleaved configuration. At 1M-token context, this reduces single-token inference FLOPs to 27% and KV cache to 10% compared to DeepSeek-V3.2. Manifold-Constrained Hyper-Connections (mHC): Constrains the residual mapping matrix to the manifold of doubly stochastic matrices (Birkhoff polytope), ensuring spectral norm ≤ 1 for stable deep-layer signal propagation. Uses Sinkhorn-Knopp iterations (t=20) for projection. Muon Optimizer: Adopted for most modules with hybrid Newton-Schulz iterations for orthogonalization. Paired with Anticipatory Routing (decoupling backbone and routing network updates) and SwiGLU clamping for training stability. Post-training paradigm shift: Replaces mixed RL with domain-specific expert training (SFT → GRPO RL) followed by multi-teacher On-Policy Distillation (OPD) with full-vocabulary KL divergence. Over 10 teacher models are distilled into a single unified model. ...