<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>MoE on Chunhao Zhang</title>
    <link>https://blog-6sm.pages.dev/en/tags/moe/</link>
    <description>Recent content in MoE on Chunhao Zhang</description>
    <image>
      <title>Chunhao Zhang</title>
      <url>https://blog-6sm.pages.dev/images/og-default.png</url>
      <link>https://blog-6sm.pages.dev/images/og-default.png</link>
    </image>
    <generator>Hugo</generator>
    <language>en</language>
    <copyright>2026</copyright>
    <lastBuildDate>Sat, 25 Apr 2026 00:00:00 +0000</lastBuildDate>
    <atom:link href="https://blog-6sm.pages.dev/en/tags/moe/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence</title>
      <link>https://blog-6sm.pages.dev/en/readings/deepseek-v4/</link>
      <pubDate>Sat, 25 Apr 2026 00:00:00 +0000</pubDate>
      <guid>https://blog-6sm.pages.dev/en/readings/deepseek-v4/</guid>
      <description>DeepSeek-V4 introduces two MoE models (1.6T/284B params) with hybrid compressed attention (CSA&#43;HCA), manifold-constrained hyper-connections, and Muon optimizer, achieving 27% inference FLOPs and 10% KV cache size compared to V3.2 at 1M-token context.</description>
      <content:encoded><![CDATA[<blockquote>
<p>Original paper: <a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf">DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence</a></p>
<p>Authors: DeepSeek-AI</p>
<p>Model checkpoints: <a href="https://huggingface.co/collections/deepseek-ai/deepseek-v4">https://huggingface.co/collections/deepseek-ai/deepseek-v4</a></p>
</blockquote>
<hr>
<h2 id="summary">Summary</h2>
<p>DeepSeek-V4 presents a preview of two strong MoE language models — <strong>DeepSeek-V4-Pro</strong> (1.6T total / 49B activated) and <strong>DeepSeek-V4-Flash</strong> (284B total / 13B activated) — both supporting a context length of <strong>one million tokens</strong>.</p>
<p><strong>Key architectural innovations:</strong></p>
<ul>
<li><strong>Hybrid Compressed Attention</strong>: Combines Compressed Sparse Attention (CSA, compression rate m=4 with top-k sparse selection) and Heavily Compressed Attention (HCA, compression rate m&rsquo;=128 with dense attention) in an interleaved configuration. At 1M-token context, this reduces single-token inference FLOPs to 27% and KV cache to 10% compared to DeepSeek-V3.2.</li>
<li><strong>Manifold-Constrained Hyper-Connections (<em>m</em>HC)</strong>: Constrains the residual mapping matrix to the manifold of doubly stochastic matrices (Birkhoff polytope), ensuring spectral norm ≤ 1 for stable deep-layer signal propagation. Uses Sinkhorn-Knopp iterations (t=20) for projection.</li>
<li><strong>Muon Optimizer</strong>: Adopted for most modules with hybrid Newton-Schulz iterations for orthogonalization. Paired with Anticipatory Routing (decoupling backbone and routing network updates) and SwiGLU clamping for training stability.</li>
</ul>
<p><strong>Post-training paradigm shift</strong>: Replaces mixed RL with domain-specific expert training (SFT → GRPO RL) followed by multi-teacher <strong>On-Policy Distillation (OPD)</strong> with full-vocabulary KL divergence. Over 10 teacher models are distilled into a single unified model.</p>
<p><strong>Infrastructure highlights</strong>: Fine-grained EP communication-computation overlap (MegaMoE, open-sourced); TileLang-based kernel development; batch-invariant and deterministic kernels; FP4 QAT with lossless FP4-to-FP8 dequantization; DSec sandbox platform managing hundreds of thousands of concurrent sandbox instances.</p>
<p><strong>Results</strong>: DeepSeek-V4-Pro-Max outperforms all prior open-source models on knowledge benchmarks, matches GPT-5.2 on reasoning, ranks 23rd on Codeforces, achieves proof-perfect 120/120 on Putnam-2025, and surpasses Gemini-3.1-Pro on long-context benchmarks.</p>
<hr>
<p><em>The Chinese version of this page contains a full annotated translation of the paper. Please refer to the <a href="https://huggingface.co/deepseek-ai/DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf">original PDF</a> for the complete English text.</em></p>
]]></content:encoded>
    </item>
  </channel>
</rss>
