YuanLab AI Releases Yuan 3.0 Ultra: A Flagship Multimodal MOE Foundation Model, Built for Strong Intelligence and Unmatched Efficiency

How can a trillion-parameter large language model achieve state-of-the-art enterprise performance while reducing its total parameter count by 33.3% and increasing pre-training efficiency by 49%? Yuan Lab AI releases Yuan3.0 Ultra, an open-source mixture-of-experts (MOE) large language model 1T total parameters And 68.8b active parameter. Model architectures are designed to optimize performance in enterprise-specific tasks while maintaining competitive general-purpose capabilities. Unlike traditional dense models, Yuan3.0 Ultra uses sparseness to increase capacity without linear increase in computational cost.

Layer-Adaptive Expert Pruning (LAEP)

Yuan3.0 is the primary innovation in Ultra’s training Layer-Adaptive Expert Pruning (LAEP) algorithm^{^{^{^{. While expert pruning is usually applied after training, LAEP identifies and removes underutilized experts directly during training. pre-training phase^{^{^{^.}}}}}}}

Research into expert load distribution revealed two distinct phases during pre-training:

Initial infection stage: Characterized by high volatility in expert weights inherited from random initialization.
Stationary Phase: Expert weights are aggregated, and the relative ranking of experts based on token assignments remains largely fixed.

Once stationary phase is reached, LAEP applies pruning based on two constraints:

Personal Weight Constraint (⍺): Targets experts whose token load layer is well below average.
Cumulative load constraint (β): Identifies the subgroup of experts who contributed the least to total token processing.

By applying LAEP with β=0.1 and varying ⍺, the model was reduced to the initial 1.5T parameters all the way down 1T parameter. it 33.3% decrease The aggregate parameters preserved the model’s multi-domain performance while significantly reducing memory requirements for deployment. In the 1T configuration, the number of experts per layer was reduced from 64 to the maximum 48 protected experts.

https://github.com/Yuan-lab-LLM/Yuan3.0-Ultra/blob/main/Docs/Yuan3.0_Ultra%20Paper.pdf

Hardware efficiency and expert rearrangement

MoE models often suffer from device-level load imbalance when experts are distributed across a computing cluster.^{^{^{^{. To address this, Yuan3.0 Ultra implements a Expert Rearrangement Algorithm^{^{^{^.}}}}}}}

This algorithm ranks experts based on token load and uses a greedy strategy to distribute them across GPUs so that the cumulative token variance is minimized.^{^{^{^.}}}

Method	TFLOPS per GPU
Base Model (1515B)	62.14
deepseek-v3 aux loss	80.82
Yuan3.0 Ultra (LAEP)	92.60

Overall pre-training efficiency improved 49%. This improvement can be attributed to two factors:

Model Pruning: Contribution 32.4% To achieve efficiency.
Expert Rearrangement: Contribution 15.9% To achieve efficiency.

Reducing Overthinking with the Revised RIRM

In the reinforcement learning (RL) stage, the model employs a sophisticated Reflection Inhibition Reward Mechanism (RIRM) To prevent excessively long logic chains for simple functions^{^{^{^.}}}

The reward for reflection, $R_{ver}$, is calculated using a threshold-based penalty system^{: :}

R_min=0: Ideal number of reflection steps for direct reactions.
R_maximum=3: Maximum tolerable reflectance limit.

For correct samples, the reward decreases as the reflection phase approaches R._maximumWhereas erroneous samples that ‘overthink’ (over R)._maximum Receive maximum penalty. As a result of this mechanism a 16.33% gain in training accuracy and a 14.38% reduction in output token length.

enterprise benchmark performance

The Yuan3.0 Ultra was evaluated in specialized enterprise benchmarks against multiple industry models, including GPT-5.2 and Gemini 3.1 Pro.^{^{^{^.}}}

benchmark	Job Category	Yuan3.0 Ultra Score	leading competitor score
Docmatix	multimodal RAG	67.4%	48.4% (GPT-5.2)
ChatRAG	Text retrieval (average)	68.2%	53.6% (km K2.5)
mmtab	table logic	62.3%	66.2% (km K2.5)
Summary	lesson summarization	62.8%	49.9% (Cloud Opus 4.6)
spider 1.0	text-to-sql	83.9%	82.7% (km K2.5)
BFCL V3	device invocation	67.8%	78.8% (Gemini 3.1 Pro)

The results show that Yuan3.0 achieves state-of-the-art accuracy in multimodal retrieval (DOCMATICS) and long-context retrieval (ChatRAG) while maintaining strong performance in ultra-structured data processing and tool calling.^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^{^.}}}}}}}}}}}}}}}}}}}}}}}}

check it out paper And repo. Also, feel free to follow us Twitter And don’t forget to join us 120k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

YuanLab AI Releases Yuan 3.0 Ultra: A Flagship Multimodal MOE Foundation Model, Built for Strong Intelligence and Unmatched Efficiency

Layer-Adaptive Expert Pruning (LAEP)

Hardware efficiency and expert rearrangement

Reducing Overthinking with the Revised RIRM

enterprise benchmark performance

iPhone 17e vs iPhone 17: I compared both models to decide which one has better value

OpenAI says ChatGPT Instant 5.3 is less cringe, more accurate

Related Articles

Leave a Comment Cancel Reply