NVIDIA AI releases C-RADIOv4 vision backbone integrating SigLIP2, DINOv3, SAM3 for classification, dense prediction, large-scale segmentation workloads

by
0 comments
NVIDIA AI releases C-RADIOv4 vision backbone integrating SigLIP2, DINOv3, SAM3 for classification, dense prediction, large-scale segmentation workloads

How do you combine SigLIP2, DINOv3, and SAM3 into a single vision backbone without sacrificing density or segmentation performance? NVIDIA’s C-RADIOv4 is a new agglomerative vision backbone that distills three robust teacher models, SigLIP2-g-384, DINOv3-7B, and SAM3, into a single student encoder. It extends the AM-RADIO and RADIOv2.5 lines, maintaining the same computational cost while improving dense prediction quality, resolution robustness, and drop-in compatibility with SAM3.

The main idea is simple. Instead of choosing between a vision language model, a self-supervised dense model, and a segmentation model, C-RADIOv4 attempts to approximate all three simultaneously with a single backbone.

https://www.arxiv.org/pdf/2601.17237

Group distillation in radio

radio family uses group distillation. A single VIT-style learner is trained to match both dense feature maps and summary tokens from many diverse teachers.

Earlier radio models used a combination of DFN CLIP, DINOv2 and SAM. They already supported multi-resolution training, but featured ‘mode switching’, where the representation changed qualitatively as the input resolution changed. Later works such as PHI-S, RADIOv2.5, and FeatSharp added better multi-resolution distillation and regularization, but the teacher set was still limited.

C-RADIOv4 upgrades teachers:

  • SIGLIP2-G-384 Strong image to text alignment
  • DINov3-7B For high quality self-supervised intensive facilities
  • SAM3 For segmentation oriented features and compatibility with SAM3 decoder

The student is trained so that its dense features match DINOv3 and SAM3, while its summary tokens match SigLIP2 and DINOv3. This provides an encoder that can support classification, retrieval, dense prediction, and segmentation.

Stochastic multi-resolution training

C-RADIOv4 uses stochastic multi-resolution training instead of a small fixed set of resolutions.

Training samples input size from two partitions:

  • Low resolution: {128, 192, 224, 256, 384, 432}
  • High Resolution: {512, 768, 1024, 1152}

SigLIP2 natively operates at 384 pixels. Its features have been upgraded by a factor of 3 using FeatSharp to align with the 1152 pixel SAM3 features. SAM3 is trained at 1152×1152 with mosaic enhancements.

This smoothes the performance curve at design resolution and improves low resolution behavior. For example, on the ADE20k linear probe, c-radiov4-h Reaches around:

  • 55.20 mIoU at 512 px
  • 57.02 mOU at 1024 pixels
  • 57.72 mOU at 1536px

The scaling trend is close to that of DINOv3-7B, using roughly an order of magnitude less.

Removing teacher noise with shift equivalent losses and MESA.

Distilling from larger vision models mimics not only their useful structure, but their artifacts. SigLIP2 has boundary noise patterns, and the ViTDet style model window may show boundary artifacts. Direct feature regression can force the student to reproduce those patterns.

C-RADIOv4 introduces two shift equivalent mechanisms to suppress such noise:

  1. shift equivalent density loss: Every teacher and student sees moved freely Crops an image. Before computing the squared error, features are aligned via shift mapping and the loss uses only overlapping spatial positions. Because the student never sees the same complete situation as the teacher, he cannot simply remember the position fixed noise and is instead forced to track the input dependent structure.
  2. shift equivalent mesa: C-RADIOv4 also uses MESA style regularization between the online network and the EMA copy. Here again, the student and his EMA crop different views, the features are aligned by a transform, and the loss is applied after layer normalization. This encourages smooth loss scenarios and robustness, while remaining irreversible to the full state.

Additionally, the training uses DAMP, which injects multiplicative noise into the weights. This further improves the robustness to corruption and small distribution changes.

Balancing teachers with angular dispersion aware summation loss

The summation loss in previous radio models used the cosine distance between student and teacher embeddings. Cosine distance removes magnitude but not directional spread On the sphere. Some teachers, such as SigLIP2, produce embeddings concentrated in a narrow cone, while DINOv3 variants produce more dispersed embeddings.

If raw cosine distance is used, teachers with wide angular dispersion cause large losses and dominate the optimization. In practice, DINOv3 dominated SigLIP2 in the summary period.

C-RADIOv4 replaces it with a angle normalized Loss. The square angle between the student and teacher embeddings is divided by the angular dispersion of the teacher. The measured dispersions show SigLIP2-g-384 to be around 0.694, while DINOv3-H+ and DINOv3-7B are around 2.12 and 2.19. Normalizing by these values ​​equalizes their impact and preserves both the visual language and the dense semantics.

Performance: Classification, Dense Prediction and Probe3d

But ImageNet-1k zero shot classificationC-RADIOv4-H reaches approx 83.09% Top-1 accuracy. It matches or improves on RADIOv2.5-H and C-RADIOv3-H with best performance near 1024 px.

But K-NN classificationC-RADIOv4-H improves on RADIOv2.5 and C-RADIOv3, and matches or surpasses DINOv3 starting at about 256 px. DINOv3 approaches 192-256 px and then degrades, while C-RADIOv4 remains stable or improves performance at higher resolutions.

Depth and 3D aware metrics show the desired tradeoffs. On ADE20k, PASCAL VOC, NAVI, and SPair, the C-RADIOv4-H and SO400M variants outperform earlier RADIO models and are competitive with DINOv3-7B on intensive benchmarks. For C-RADIOv4-H, typical scores are:

  • ADE20k: 55.20 mIoU
  • VOC: 87.24 MIOU
  • Navi: 63.44
  • Spire: 60.57
https://www.arxiv.org/pdf/2601.17237

But probe3dIncluding Depth Normals, Surface Normals, NAVI and SPair, C-RADIOv4-H achieves the best navi And SPair Score in Radio Family. Depth and surface metrics are close to C-RADIOv3-H, with small differences in both directions rather than a uniform improvement.

Integration with SAM3 and ViTDet-mode deployment

C-RADIOv4 is designed as a replacement for the perception encoder backbone in SAM3. The SAM3 decoder and memory components remain unchanged. A reference implementation is provided in the SAM3 fork. Qualitative examples show that the segmentation behavior is preserved for both text signals such as “shoe”, “helmet”, “bike”, “spectator” and box signals, and in some reported cases resolves failure cases from the C-RADIOv4 based SAM3 native encoder.

For deployment, C-RADIOv4 exposes a ViTDet-Mode layout. Most Transformer blocks use windowed attention, while some use global attention. Supported window sizes range from 6×6 to 32×32 tokens, subject to divisibility with patch size and image resolution. On the A100, the SO400M model with a maximum window size of 12 is faster than the SAM3 ViT-L+ encoder across a wide range of input sizes, and the giant model with a window size of 8 is close in latency.

This makes C-RADIOv4 a practical backbone for high resolution intensive tasks where full global attention at all levels is too expensive.

key takeaways

  1. Single Integrated Spine: C-RADIOv4 distills SigLIP2-g-384, DINOv3-7B, and SAM3 into a ViT-style encoder that supports classification, retrieval, dense prediction, and segmentation.
  2. Any-resolution behavior: Stochastic multi-resolution training on {128…1152} px, and ftsharp upsampling for SIGLIP2, stabilizes performance across all resolutions and tracks DINOv3-7B scaling with very few parameters.
  3. Noise Suppression via Shift Equivariance: Shift Equivariant Dense Loss and Shift Equivariant MESA prevent teacher copying of border and window artifacts, focusing student learning on input dependent semantics.
  4. Balanced Multi-Teacher Distillation: An angular dispersion normalized summation loss equalizes the contributions of SigLIP2 and DINOv3, preserving both text alignment and dense representation quality.
  5. SAM3 and ViTDet-ready deployment: C-RADIOv4 can directly replace the SAM3 Perception Encoder, provides ViTDet-mode window attention for fast high resolution estimation, and is released under the NVIDIA Open Model License.

check it out paper, repo, Model-1 And Model-2. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.


Related Articles

Leave a Comment