DeepSeek-AI has released a preview version of the DeepSeek-v4 series: two mixture-of-experts (MOE) language models built around a core challenge, making a million-token context window practical and economical at inference time.
The series includes DeepSeek-v4-Pro, which has 1.6T total parameters and 49B actives per token, and DeepSeek-V4-Flash, which has 284B total parameters and 13B actives per token. Both models natively support a reference length of one million tokens. DeepSeek-V4-Pro was pre-trained on 33T tokens and DeepSeek-V4-Flash was pre-trained on 32T tokens. Model checkpoints for all four variants: deepseek-v4-pro, deepseek-v4-pro-base, deepseek-v4-flash, and deepseek-v4-flash-base are publicly available on Hugging Face.


Long context architectural challenges
The vanilla attention mechanism in a standard Transformer has quadratic computational complexity with respect to sequence length, doubling the context almost quadruples the attention count and memory. At one million tokens, this becomes prohibitive without architectural intervention. DeepSeq-V4 addresses this through four coordinated innovations: a hybrid attention architecture, a new residual connection design, a discrete optimizer, and FP4 quantization-aware training.


Hybrid focus: CSA and HCA
The central architectural innovation is a hybrid mechanism combining Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA), interconnected in transformer layers.
CSA compresses the key-value (KV) cache of each m tokens into an entry using a learned token-level compressor, then DeepSeek Sparse Attention (DSA) is applied, where each query token visits only the top-Of Selected compressed KV entries. A component called Lightning Indexer handles sparse selection by scoring queries against compressed KV blocks. Both CSA and HCA include a sliding window focus branch covering the latest nwin Tokens for local dependency modeling.
HCA is more aggressive: it consolidates the KV entries of each M’ token – where m′ ≫ m It then applies dense attention to those representations, into a single compressed entry. No sparse selection step is required; The compression ratio itself reduces the KV cache size.
The efficiency gains are substantial. In a million-token setting, DeepSeek-v4-Pro requires only 27% of the single-token inference FLOPs (equivalent to FP8 FLOPs) and 10% of the KV cache size of DeepSeek-v3.2. DeepSeek-V4-Flash gains 10% single-token FLOPs and 7% KV cache relative to DeepSeek-V3.2.
Manifold-constrained hyper-connection (MHC)
DeepSeek-V4 replaces traditional residual connections with manifold-constrained hyper-connections (MHC). Hyper-connection (HC) generalizes residual connections by expanding the width of the residual stream by a factor ncourt (set to 4 in both models), introducing the learned input, residual and output mapping metrics. Naive HC suffers from numerical instability when stacking multiple layers.
MHC solves this by constraining the residual mapping matrix bl For Birkhoff polytope – variety of doubly stochastic matrix where all rows and columns sum to one and all entries are non-negative. This limits the spectral parameter of the mapping to 1, preventing signal amplification in both the forward pass and backpropagation. The constraint is implemented via the Sinkhorn–Knopf algorithm. t_max = 20 iterations. The mapping parameters are generated dynamically per-input for the expression.
Muon Optimizer and FP4 QAT
DeepSeq-V4 adopts the muon optimizer for most of its parameters. Muon uses Newton–Schulze iterations to approximately orthogonalize the gradient update matrix before applying it as a weight update. The implementation uses a hybrid two-stage schedule: 8 iterations with coefficients (3.4445, −4.7750, 2.0315) for faster convergence, then 2 stabilization iterations with coefficients (2, −1.5, 0.5). AdamW is retained for the embedding module, prediction head, constant bias and gating factors of the MHC module, and all RMSNorm weights.
For deployment efficiency, FP4 (MXFP4) quantization-aware training (QAT) is applied to the MOE expert weights and query-key (QK) paths in the CSA’s Lightning Indexer. During inference and RL rollout, actual FP4 weights are used directly instead of simulated quantization, reducing memory traffic and sampling latency.
scale training consistency
Training the trillion-parameter MoE model introduced notable instabilities. Two techniques proved effective. Anticipatory Routing Isolates the backbone and routes network updates: Step by Step Routing Index Tea Calculated using historical parameters θt−ΔtBreaking the cycle in which routing decisions reinforce external values ​​in MoE layers. Swigloo clamping limits the linear component of Swigloo to (−10, 10) and limits the upper bound of the gate component to 10, which directly suppresses abnormal activation. Both techniques were applied during the training of both models.
After training: expert expert and on-policy distillation
The post-training pipeline replaces the mixed RL stage of DeepSeq-v3.2 with On-Policy Distillation (OPD). Independent domain experts are first trained in mathematics, coding, agent tasks, and instruction via Supervised Fine-Tuning (SFT), followed by reinforcement learning using Group Relative Policy Optimization (GRPO). The ten teacher models are then distilled to a single unified student model by minimizing the reverse KL divergence between the output distributions over the student and each teacher’s student’s own generated trajectories, using full-term logit distillation for stable gradient estimates.
The resulting model supports three reasoning effort modes: Non-Thinking (fast, no clear thought-chain), Think High (deliberate reasoning), and Think Max (maximum reasoning effort with a dedicated system prompt and low length penalty during RL training).
benchmark results
DeepSeek-v4-Pro-Max achieved a Codeforce rating of 3206, ahead of GPT-5.4-XHigh (3168) and Gemini-3.1-Pro-High (3052). On SimpleQA verified, it scores 57.9 pass@1, which outperforms Cloud Opus 4.6 Max (46.2) and GPT-5.4-xHigh (45.3), though lags behind Gemini-3.1-Pro-High (75.6). On SWE-Verified, DeepSeek-v4-Pro-Max achieved 80.6% resolution, slightly behind Cloud Opus 4.6 Max (80.8%), while Gemini-3.1-Pro-High also scored 80.6%.
On long-reference benchmarks, deepseek-v4-pro-max scores 83.5 MMR on OpenAI MRCR 1m and 62.0 accuracy on CorpusQA 1m, which is ahead of Gemini-3.1-pro-high (76.3 and 53.8, respectively), but behind Cloud Opus 4.6 max (92.9 and 71.7) on both.
key takeaways
- The focus of hybrid CSA and HCA reduces KV cache to 10% of DeepSeek-v3.2 at 1M tokens.
- Manifold-constrained hyper-connections (MHC) replace residual connections for more stable deep layer training.
- The Muon optimizer replaces AdamW for most parameters, providing faster convergence and training stability.
- After training, on-policy distillation from 10+ domain experts is used instead of traditional mixed RL.
- DeepSeek-v4-flash-base outperforms DeepSeek-v3.2-base despite having 3x fewer active parameters.
check it out paper And model weight. Also, feel free to follow us Twitter And don’t forget to join us 130k+ ML subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us
