NVIDIA AI Releases Star Elastic: A Checkpoint Including 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing

by ai-intensify
0 comments
NVIDIA AI Releases Star Elastic: A Checkpoint Including 30B, 23B, and 12B Reasoning Models with Zero-Shot Slicing

Training a family of large language models (LLMs) always comes with a pain multiplier: each model version in the family – whether 8B, 30B, or 70B – typically requires its own complete training run, its own storage, and its own deployment stack. For a dev team doing large-scale inference, this means multiplying the compute cost by the model size they want to support. NVIDIA researchers are now proposing a different approach called star elastic.

star elastic is a post-training method that embeds multiple nested submodels at different parameter budgets inside a single parent reasoning model using a single training run. applied for nemotron nano v3 (a hybrid Mamba-Transformer-MOE model with 30B total parameters and 3.6B active parameters), Star Elastic produces 23B (2.8B active) and 12B (2.0B active) nested variants trained with approximately 160B tokens. All three variants reside in one checkpoint and can be removed without any additional fine-tuning.

Here’s what “nested” really means

If you haven’t encountered Elastic or nested architectures before, the idea is this: instead of training three separate 30B, 23B and 12B models, you train one model. prevent As a subset of the smaller self. Smaller submodels reuse the most important weights from the original, which are identified through a process assessment of importance.

Star Elastic scores each model component: embedding channel, attention head, Mamba SSM head, MOE expert and FFN channel according to how much they contribute to model accuracy. The components are then sorted and ordered, so the smaller budget submodel always uses the highest-ranked contiguous subset of components from the larger model. This property is called nested weight-sharing.

The method supports nesting along multiple axes: SSM (State Space Model) dimension, embedding channel, attention head, Mamba head and head channels, MOE expert count and FFN intermediate dimension. Especially for MoE layers, Star Elastic is used Router-Weighted Expert Activation Pruning (REAP)which ranks experts based on both routing gate values And Expert output magnitude – a more theoretical signal than naive frequency-based sorting, which ignores how much each expert actually contributes to the layer output.

Learnable router, not fixed compression method

A significant difference from prior compression methods such as Minitron is that Star Elastic uses a End-to-End Trainable Router To determine nested submodel architecture. The router takes a target budget (for example, “give me a 2.8B active parameter model”) as one-hot input and outputs different masks that select which components are active at that budget level. These masks are jointly trained with the model Gumbel-SoftmaxWhich allows gradient flow through different architectural decisions.

loss function converges Knowledge Distillation (KD) Where the non-elastified parent model acts as a teacher with a router loss that penalizes deviations from the target resource budget (parameter count, memory, or latency). This means that the router learns to make architecture choices that actually improve accuracy under KD rather than simply reducing the proxy metric.

training uses a two stage course: a short-context phase (sequence length 8,192 tokens) with uniform budget sampling, followed by an extended-context phase (sequence length 49,152 tokens) with non-uniform sampling that prioritizes the full 30B model (p(30B)=0.5, p(23B)=0.3, p(12B)=0.2). The extended reference stage is critical to logic performance. The research team’s analysis on Nano v2 – reproduced explicitly as an empirical basis for a similar course option on Nano v3, shows a gain of up to 19.8% over AIME-2025 for the 6B variant and a gain of 4.0 percentage points for the 12B variant over Phase 2 alone, motivating its use here.

Elastic budget control: different models for different logic steps

Nemotron works by capping the number of tokens generated during one of the current budget controls in the logic model, including Nano v3’s own default behavior. The stage before being forced to give a final answer. This approach uses the same model throughout. Star Elastic unlocks a different strategy: using various nested submodels For the thinking phase versus the answering phase.

The researchers evaluated four configurations. The optimal one, is called ℳS → ℳL (small model for thinking, big model for answering), allocates a cheap model to generate extended logic traces and reserves a full capacity model to synthesize the final answer. The 23B → 30B configuration specifically pushes the accuracy-latency Pareto frontier, achieving up to 16% higher accuracy and 1.9× lower latency Compared to the default Nemotron Nano v3 budget control. Intuition: Logic tokens are high volume but tolerant of some capacity reduction; The final answer requires high precision.

Quantization without breaking nested structure

A simple way to deploy a quantized elastic model would be to quantize each variant separately after slicing. This breaks the nested weight-sharing property and requires a separate quantization pass per shape. Instead, star elastic is applied Quantization-aware distillation (QAD) Directly to the elastic checkpoint, preserving the nested mask hierarchy.

For FP8 (E4M3 format), post-training quantization (PTQ) is sufficient, recovering 98.69% of BF16 accuracy on the 30B version. For NVFP4 (NVIDIA’s 4-bit floating-point format), PTQ alone causes a 4.12% average accuracy drop, so a small nested QAD step (~5B tokens in 48K context) brings it back to 97.79% for the 30B variant. In both cases, zero-shot slicing of 23b and 12b variants from a single quantified checkpoint is preserved.

The implications of memory are important. Storing the individual 12B, 23B, and 30B BF16 checkpoints requires 126.1 GB; A single elastic checkpoint requires 58.9 GB. The 30B NVFP4 elastic checkpoint fits into 18.7GB, enabling the 12B NVFP4 variant to run on the RTX 5080 where each BF16 configuration runs out of memory. On the RTX Pro 6000, the 12B NVFP4 variant reaches 7,426 tokens/s, which is a 3.4× throughput improvement over the 30B BF16 baseline.

Depth vs Width: Why does Star Elastic compress the width

One design choice clearly worth noting: The research team compared two compression strategies – removing layers entirely (depth compression) versus reducing internal dimensions such as hidden sizes, expert calculations, and head count (width compression). With 15% parameter reduction and 25B tokens of knowledge distillation, width compression regained 98.1% of baseline performance While depth compression only recovered 95.2%With noticeable degradation on HumanEval and MMLU-Pro. As a result, Star Elastic prefers width-based elasticity for its core results, although depth compression (layer skipping) remains available as a mechanism for extremely latency-constrained scenarios.

On the evaluation suites—AIME-2025, GPQA, LiveCodeBench v5, MMLU-Pro, IFbench, and Tau Bench—the Elastic-30b variant matches its parent Nemotron Nano v3 30b on most benchmarks, while the Elastic-23b and Elastic-12b variants remain competitive against independently trained models of similar size. Despite having fewer active parameters, the Elastic-23B specifically gets a score of 85.63 on AIME-2025, while the Qwen3-30B-A3B gets 80.00.

On training costs, research team reports 360× token reduction than pre-training each version from the beginning, and a 7× reduction over earlier state-of-the-art compression methods, which required sequential distillation per model size. The 12B variant runs at 2.4× the throughput of the 30B parent on an H100 GPU on bfloat16 with the same input/output sequence length.

How to use NVIDIA Star Elastic

Step by Step Guide

Nemotron Nano v3 Elastic – 30B/23B/12B BF16/FP8/NVFP4 in one checkpoint

Star elastic models are delivered through the hugging face and support both
transformer (for use) and VLLM
(Recommended for production estimation). Choose the option that fits your use case.

Bash

# Option A — vLLM (recommended for production serving)
pip install vllm

# Option B — Transformers (for local experimentation)
pip install transformers torch accelerate

# Optional: log in to Hugging Face if needed
pip install huggingface_hub
huggingface-cli login



Hardware Note: The 30B BF16 checkpoint requires ~60GB of VRAM for a fully nested family. Use FP8 (~31 GB) or NVFP4 (~19 GB) for H100/A100 or RTX-class deployments.

A single checkpoint includes all three nested variants – 30B (3.6A),
23B (2.8A)And 12B (2.0A). Load once; Remove any type without retraining. model required trust_remote_code=True For hybrid Mamba-Transformer-MOE architecture.

Python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# The 30B BF16 elastic checkpoint — contains all 3 nested variants
model_id = "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16"

tokenizer = AutoTokenizer.from_pretrained(
    model_id,
    trust_remote_code=True
)

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    trust_remote_code=True,
    torch_dtype=torch.bfloat16,
    device_map="auto"     # distributes across available GPUs
)

print(f"Model loaded: {model_id}")



Active vs Total Parameters: “30B total / 3.6B active” means that the model stores 30B weights but routes each token through only 3.6B parameters per forward pass – this is how Mixture-of-Experts (MOE) works.

The model uses a To generate a logic chain before giving your final answer. Control the total token budget through max_new_tokens
– Higher values ​​allow longer reasoning on difficult problems.

Python

messages = (
    {
        "role": "user",
        "content": "What is the time complexity of QuickSort, and why?"
    }
)

# Apply chat template and tokenize
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)

# Generate — model produces ... then the final answer
outputs = model.generate(
    **inputs,
    max_new_tokens=4096,    # thinking + answer budget
    temperature=0.6,
    top_p=0.95,
    do_sample=True
)

response = tokenizer.decode(
    outputs(0)(inputs("input_ids").shape(-1):),
    skip_special_tokens=True
)
print(response)



Budget Considerations Tip: For math/coding problems, set max_new_tokens
At 8192-32768. For simple queries, 2048-4096 is sufficient and reduces latency.

For production deployment, use VLLM To serve models via OpenAI-compliant REST API. It enables batched estimation, continuous batching and high throughput – achieves 12B variant 2.4× throughput
Of 30B parent on H100 GPU.

Bash

# Start the vLLM server (OpenAI-compatible)
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16"

# --- In a separate terminal ---

# Query the server via curl
curl -X POST "http://localhost:8000/v1/chat/completions" 
  -H "Content-Type: application/json" 
  --data '{
    "model": "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16",
    "messages": (
      {
        "role": "user",
        "content": "Explain gradient descent in 3 steps."
      }
    ),
    "max_tokens": 4096,
    "temperature": 0.6
  }'

# Or run via Docker
docker model run hf.co/nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16



sglang options: SGLang is also supported – run python3 -m sglang.launch_server --model-path "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16" --port 30000
For drop-in option of VLLM.

Three quantified checkpoints are available. all preserve nested structure
– The 23B and 12B submodels can be zero-shot fired from any precision checkpoint you load. NVFP4 uses Quantization-Aware Distillation (QAD) to recover the accuracy lost from PTQ.

Bash

# BF16 — full precision, all nested variants in 58.9 GB
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-BF16"

# FP8 (E4M3) — ~2x smaller, 30B fits in 31.4 GB
# Post-training quantization, 98.69% accuracy recovery on 30B
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-FP8"

# NVFP4 — smallest footprint, 30B fits in 18.7 GB
# 12B NVFP4 variant runs on RTX 5080 (BF16 OOMs)
# 12B NVFP4 on RTX Pro 6000: 7,426 tokens/s (3.4x vs 30B BF16)
vllm serve "nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B-NVFP4"

Type 30B memory 23b memory 12b memory best for
BF16 full 58.9 GB 44.0 GB 23.2 GB A100/H100
FP8 PTQ 31.4 GB 23.7 GB 13.0 GB H100/A100/RTX 5090
nvfp4 QAD 18.7 GB 14.1 GB 8.0 GB RTX 5080/5090/Pro 6000


step 1 of 5

key takeaways

  • Star Elastic trains 30B, 23B, and 12B nested reasoning models from a single 160B-token post-training run, achieving a 360× token reduction over pretraining from scratch.
  • Elastic budget control (23B for thinking, 30B for answering) improves the accuracy-latency Pareto frontier to 16% accuracy and 1.9× latency gain.
  • Learnable routers with Gumbel-Softmax enable end-to-end trainable architecture selection, eliminating the need for separate compression runs per model size.
  • Nested QAD preserves zero-shot slicing at FP8 and NVFP4 quantified checkpoints, reducing a 30B elastic checkpoint to 18.7 GB in NVFP4.
  • All three precision variants (BF16, FP8, NVFP4) are publicly available under Hugging Face nvidia/NVIDIA-Nemotron-Labs-3-Elastic-30B-A3B.

check it out paper, elastic model But hugging face BF16, FP8 And nvfp4 . Also, feel free to follow us Twitter And don’t forget to join us 150k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us


Related Articles

Leave a Comment