Introducing AI Runtime: Scalable, Serverless NVIDIA GPUs on Databricks for Training and Finetuning

GPUs power today’s most advanced AI workloads – from predictions and recommendations to multimodal foundation models. However, teams struggle to procure and manage GPU infrastructure, configure distributed training environments, and overcome data loading bottlenecks. Deep learning researchers prefer to focus on modeling, not problem-solving infrastructure.

We’re excited to announce the public preview of AI Runtime (AIR)A new training stack that enables on demand Distributed GPU training on A10s and H100s. The AI runtime includes all the technologies used for large-scale training of LLMs like MPT dbrx. Even in beta, hundreds of customers, including Rivian, FactSet, and Ypitdata, have used AIR to train deep learning models and send them into production. Use cases span from computer vision models to recommender systems and refining LLMs for agentic tasks. Our own Databricks AI research team used AIR for reinforcement learning of our recent models carl paper.

with AI RuntimeDatabricks users now have:

Serverless, on-demand NVIDIA GPU: Configure your notebook in just 2-3 clicks, and connect to serverless fast A10 and H100 GPU To start training – no cluster required. Pay only for the GPUs you use without worrying about idle time usage.
Strong orchestration tools: Harness the full power of Databricks’ orchestration suite with Lakeflow Jobs and DAB support for long-running GPU workloads
Customized Delivered Training: AIR bundles deliver GPU performance enhancements such as RDMA and high-performance data loading
Centralized governance and oversight: Run, inspect, and control GPU workloads where your data lives, with built-in experiment management through MLflow, access management with Unity Catalog, and agent-assisted debugging.

On-demand NVIDIA H100 and A10 GPUs in notebooks

Connect on-demand A10s and H100s to Databricks notebooks with just a few clicks, for interactive development and debugging. From there, take advantage of all the developer ergonomics that Databricks is known for, from environment management to agent-driven authoring and debugging for common Python packages. genie code. Easily mount data from Lakehouse to train deep learning models, or even deploy a fleet of remote CPUs to Spark data processing workloads from your GPU-powered notebook to prepare your data.

genie code demo

Use Genie code to help solve performance bottlenecks, experiment with new architectures, or debug tricky bugs around model convergence or cryptic framework errors.

Lakeflow for production-ready workloads

AI Runtime is a production-grade platform for accelerated computing. Develop your deep learning code in an interactive notebook, and then harness its full power lake flow Submitting and organizing tasks on GPU compute. Both notebooks and custom code repositories can be executed by Lakeflow for long-running or scheduled tasks. For production requirements such as CI/CD (continuous integration and continuous deployment), the AI runtime is fully compatible with our Declarative Automation Bundle (DAB).

With our Lakeflow integration, customers can keep model training and fine-tuning tightly synchronized with upstream data pipelines and downstream production systems.

test work

Runtime optimized for distributed deep learning

Preparing, debugging, and inspecting distributed training workloads can be painful. From troubleshooting RDMA setup to tracking telemetry from multiple GPUs to proper software configuration, users can easily miss critical details that dramatically slow down model training.

Instead, the AI runtime is optimized for the entire deep learning lifecycle – and designed to save you time. Key dependencies like PyTorch and CUDA come pre-installed, as well as optimized support for distributed training frameworks like Ray, Hugging Face Transformer, Composer, and other libraries, so you can start training immediately without having to manage the environment. Customers are also welcome to bring their own library, from Unsloth to Torchrack to custom training loops.

Integrated SDK and observation tools simplify management of distributed training workloads.

Integrated SDK and observation tools simplify management of distributed training workloads. MLFflow enables deep observability of GPU workloads, with automated tracking of GPU usage and training experiments. Whether you’re fine-tuning foundation models or training prediction and personalization models, the runtime is optimized to accelerate the training workflow with minimal setup.

MLFflow enables deep observability of GPU workloads, with automated tracking of GPU usage and training experiments.

Today’s public preview of the AI runtime supports distributed training in 8x H100s in single-node, with multi-node support currently in private preview.

Centralized data governance and observability

The AI Runtime seamlessly integrates with Databricks Lakehouse, enabling you to run GPU workloads and control where your data lives. This eliminates fragmented workflows and streamlines the path from experimentation to production.

Centralized governance with unity list: Enforce consistent access controls, lineage, and governance policies across both data and AI workloads, enabling secure and compliant use of GPU resources.
Integrated Observability: Track and monitor all workloads—CPU and GPU—in one place using native system tables for integrated auditing, usage tracking, and operational insights.

Your AI workloads run entirely within your enterprise data perimeter, providing strong governance and security without sacrificing flexibility for experimentation and scale.

Integrating next-generation GPU innovation from NVIDIA

There is an increasing demand for accelerated computation in AI workloads and agentic systems. AI Runtime enables more Databricks customers to leverage NVIDIA hardware to accelerate their AI workloads and advance their business. We’re excited to continue partnering with NVIDIA to bring this Latest NVIDIA TechnologyLike the RTX Pro 4500 Blackwell Server Edition, announced at GTC 2026 for our customers.

Get started with AI Runtime today

To help you get started, we’ve put together several template notebooks and starter guides:

please see our documentation For detailed instructions on setup and daily use.
starter templates For training recommender systems, classic ML models, fine-tuning LLM and more!
migration guide From classic compute GPU workloads to serverless.

To learn more or if you have any questions, please contact your account team!

Introducing AI Runtime: Scalable, Serverless NVIDIA GPUs on Databricks for Training and Finetuning

On-demand NVIDIA H100 and A10 GPUs in notebooks

Lakeflow for production-ready workloads

Runtime optimized for distributed deep learning

Centralized data governance and observability

Integrating next-generation GPU innovation from NVIDIA

Get started with AI Runtime today

5 Useful Python Scripts for Synthetic Data Generation

Former vice president says influential vaccine advisory panel ACIP could be ‘disbanded’ after lawsuit

Related Articles

Leave a Comment Cancel Reply