Deep Dive

Serverless GPUs
from the ground up

What "serverless" actually means when a single GPU costs $30/hr to rent — and why that changes everything about how you train models.

Samwel Emmanuel · April 2026 · ~15 min read

5–12 minutes. Every time.

That's how long it takes to start a GPU cluster on Databricks the classic way. You open the compute tab, pick an instance type, choose a runtime version, configure autoscaling, pick a driver version, set up your spot instance policy, and wait. Then you do it again next week when you need slightly different hardware.

I'm not being unfair. This is just the reality of managed GPU infrastructure. It's not a Databricks problem — it's a fundamental property of spinning up cloud VMs. Somebody has to provision the hardware. Somebody has to install CUDA. Somebody has to make sure NCCL can find all the nodes.

For a while, that somebody was you.

12 min Median cold-start time for a classic GPU cluster on Databricks. During which you're either paying for idle compute, or not — and if not, you're waiting.

Databricks AI Runtime changes that. You open a notebook, click "Connect", select a GPU type from a dropdown, and within seconds to a couple of minutes you're running PyTorch code on a real NVIDIA GPU. No cluster configuration. No driver selection. No autoscaling policy. No idle charges.

That's the headline. But the headline understates what's actually interesting here. In this post, we're going to cover:

      What you'll learn:

      • Why GPU cluster setup is genuinely hard — and what infrastructure you're actually dealing with

      • What "serverless GPU" means under the hood — where the hardware lives and how your code gets there

      • How the distributed training API works — including what that @distributed decorator is doing behind the scenes

      • The cost model — and why idle GPU time matters more than you think

      • What the tradeoffs are — because there are real ones, and you should know them before you commit

If you already know how CUDA and distributed training work and just want to see the API, feel free to jump straight to the distributed training section. But if you want to understand why the design is the way it is, start here.

Why is GPU cluster setup so complicated, anyway?

To understand what AI Runtime is solving, you need to understand what it's replacing.

When you set up a classic Databricks cluster with GPU workers, you're actually configuring a lot of things:

What happens when you want to run training code

Classic GPU Cluster

01Choose instance type (p4d.24xlarge? g5.48xlarge? g4dn.12xlarge?)

02Select Databricks Runtime ML version

03Configure driver + worker node types separately

04Set autoscaling min/max and termination policy

05Wait 5–12 minutes for VMs to provision in your cloud account

06Install extra pip packages (clock is running)

07Actually run your training code

08Manually terminate cluster or pay for idle time

→

AI Runtime (Serverless GPU)

01Click "Connect" → select A10 or H100

02Start writing training code

—Steps 3–8 handled by Databricks

The instance type question alone is genuinely hard. A100s and H100s have different memory bandwidth profiles. A10s are great for inference and fine-tuning but won't saturate high-bandwidth NVLink interconnects. The p4d.24xlarge gives you 8x A100 40GB but a different price profile than p4de.24xlarge with 8x A100 80GB. You're making hardware architecture decisions before you've written a single line of training code.

AI Runtime makes those decisions for you. You pick A10 or H100. Databricks handles the rest.

Wait, so where are the GPUs actually running?

In Databricks' own serverless compute plane — not in your AWS/Azure/GCP account. This is the key architectural difference. Classic clusters provision VMs in your cloud account (with your IAM roles, your VPCs, your instance quotas). Serverless GPU runs in Databricks-managed infrastructure. Your code and data still access Unity Catalog as normal, but the compute lives elsewhere.

Starting from scratch: what does a GPU actually need?

Let's build this from the bottom up. Forget Databricks for a second — what does it actually take to run a PyTorch training job across multiple GPUs?

At minimum, you need:

A physical GPU with enough VRAM to hold your model. A CUDA toolkit that matches the GPU driver version. A PyTorch installation compiled against that CUDA version. A way for multiple GPUs to communicate with each other during training.

The last one is where things get interesting.

The GPU communication problem

Modern LLM training doesn't just use one GPU. A 70B parameter model at full precision requires about 140GB of GPU memory — way more than a single H100's 80GB. So you split the model across multiple GPUs and run distributed training.

The GPUs need to constantly talk to each other — syncing gradients after each backward pass, sharing activations during forward passes, reducing parameters across devices. This inter-GPU communication is handled by a library called NCCL (NVIDIA Collective Communications Library), and getting NCCL to work correctly across nodes is... not trivial.

H100 node: 8 GPUs × 80GB = 640GB total VRAM (AI Runtime maximum: 32 GPUs)

GPU 0

GB HBM3

GPU 1

GB HBM3

GPU 2

GB HBM3

GPU 3

GB HBM3

GPU 4

GB HBM3

GPU 5

GB HBM3

GPU 6

GB HBM3

GPU 7

GB HBM3

NVLink interconnects GPUs within a node at ~900 GB/s. Across nodes, communication falls back to network fabric — which is why multi-node training has higher latency than single-node.

NCCL needs to know the network topology. It needs proper process group initialization. It needs a "rendezvous" mechanism where rank-zero (the coordinator process) broadcasts connection information to all other ranks. If anything in that chain is wrong — the wrong IP, a firewall rule, a mismatched NCCL version — your training job silently hangs or crashes with an error that looks like a networking problem.

AI Runtime handles all of this. When you use the @distributed decorator, Databricks automatically sets up the process group, the rendezvous, and the NCCL environment. You don't configure any of it.

The `@distributed` decorator: what it's actually doing

Here's the thing about the distributed training API that surprised me when I first looked at it: it's doing a lot of work behind a very thin surface area.

The API looks like this:

from serverless_gpu import distributed

@distributed(num_gpus=8, gpu_type="H100")
def train(rank, world_size):
    # This function runs on ALL 8 GPUs simultaneously
    # rank 0 is the coordinator; ranks 1-7 are workers
    model = MyModel().to("cuda")
    model = DDP(model, device_ids=[rank])

    for batch in dataloader:
        loss = model(batch)
        loss.backward()
        optimizer.step()

# Kick off the distributed run — notebook blocks until done
train.distributed()

Simple, right? But what actually happens when you call train.distributed()?

What happens under the hood when you call train.distributed()

NCCL allreduce gradient sync MLflow + stdout → notebook

When you call train.distributed(), the library:

Serializes your function

Uses cloudpickle to serialize your decorated function and its arguments into a per-run staging directory. A wrapper script called _air.py is auto-generated as the entrypoint.

Snapshots your pip environment

Your local notebook's installed packages are snapshotted and staged to DBFS. When GPU workers start, they download and re-hydrate this exact environment — so import transformers on your notebook works identically on rank-3 of a remote H100.

Dispatches via torchrun

The dispatcher sets MASTER_ADDR, MASTER_PORT, assigns ranks 0 through N-1, and launches the function on all GPUs simultaneously using torchrun-style process launch. This is the same mechanism you'd configure manually with distributed training, just automated.

Streams back to your notebook

stdout, stderr, and MLflow system metrics (GPU utilization, memory) stream back to your notebook cell in real time. You see training logs exactly as if you were running locally.

Does @distributed work with frameworks other than raw PyTorch DDP?

Yes. The environment snapshots your installed packages, so if you have Hugging Face Accelerate, DeepSpeed, or Axolotl installed, those work as normal inside the function. The Databricks environment v4 ships with Transformers 4.56.1, PEFT 0.17.1, Accelerate 1.10.1, and Ray 2.49.1 pre-installed — so for most fine-tuning workflows you're not installing anything extra.

The hardware: two GPUs, very different tradeoffs

AI Runtime currently gives you a choice of two GPUs. The choice is real — they're not interchangeable.

NVIDIA A10

VRAM24 GB GDDR6

Max GPU count32 GPUs

Multi-node✓ Supported

Best forLoRA, QLoRA, small models

CostLower

NVIDIA H100 80GB

VRAM80 GB HBM3

Max GPU count8 GPUs (single node)

Multi-nodePrivate Preview

Best forFull fine-tune, large models

CostHigher

A10s are the pragmatic choice. At 24GB per card, you can fine-tune a 7–13B model with LoRA on a single A10 without breaking a sweat. They support multi-node, which means you can spread across more than one physical machine — useful if you need more than 32 A10s (which would be a very large workload).

H100s are the heavy machinery. At 80GB per card with HBM3 memory bandwidth, they're what you want for full-precision fine-tuning of 30B+ models, or when raw throughput matters. The tradeoff is that multi-node H100 is still in Private Preview — so you're currently capped at one 8-GPU node (640GB total VRAM). That's enough for most things, but not everything.

What about A100s? Those are everywhere else.

Not currently available in AI Runtime. The H100 is the successor to the A100 with roughly 3× the training throughput on transformer workloads due to Hopper architecture improvements (transformer engine, FP8 support, higher memory bandwidth). Databricks has also announced upcoming support for NVIDIA's Blackwell RTX PRO 4500 Server Edition from GTC 2026, but that doesn't have a Public Preview date yet.

What do people actually use this for?

Let's make this concrete. These are the documented use cases, each with the setup that makes sense for them.

↻

LLM Fine-tuning (LoRA / QLoRA)

Fine-tune Llama 3.1 8B, Qwen2, Mistral, or OLMo on domain-specific data. LoRA lets you train with far less VRAM by only updating a small subset of parameters — 1–2 A10s typically suffice for 7–13B models. Full-parameter fine-tuning of 30B+ needs H100s.

◈

Computer Vision

Object detection, image classification, multimodal training. Rivian used AI Runtime to train audio and multimodal models for vehicle AI applications — a production use case at scale.

▦

Deep Learning for Tabular / Time Series

XGBoost GPU training, LightGBM with GPU acceleration, PyTorch-based time series forecasting. GPU acceleration for tree-based methods is often overlooked but can give 10–20× speedup over CPU on large datasets.

⇄

Recommender Systems

Deep learning two-tower models with TorchRec, collaborative filtering, embedding-based retrieval. FactSet used AI Runtime to build a custom Text-to-Formula model — setup-to-inference went from days to hours.

◎

Reinforcement Learning

RLHF, PPO, and reward model training. Databricks' own AI Research team used AI Runtime for the KARL (Knowledge Augmented Reasoning) paper — a notable internal validation of the platform.

⊹

Batch Inference / Scoring

Running a fine-tuned or foundation model against a large dataset in the Lakehouse. Not the same as model serving (that's Mosaic AI Model Serving), but for one-off or scheduled batch runs against Delta tables.

The cost model: why "serverless" changes the economics

The pricing model for AI Runtime is fundamentally different from classic GPU clusters, and the difference matters more than you might expect.

Classic GPU clusters charge you for the entire lifetime of the cluster — including idle time. If your cluster is running but your notebook isn't actively executing, you're still paying. This creates perverse incentives: teams tend to keep clusters alive longer than necessary (to avoid the 5–12 minute restart), which means more idle charges.

AI Runtime auto-terminates after 60 minutes of inactivity. You only pay for active compute seconds. There's no cluster to forget to shut down.

Interactive: How idle time affects your bill

Training job duration: 2 hours of active GPU time

Idle time on classic cluster (waiting, iterating, meetings): 4 hours

Classic Cluster

—

Active + idle hours billed

AI Runtime

—

Active hours only

Approximate — based on A10 GPU-hours at typical cloud rates. Actual Databricks DBU pricing varies by contract.

The billing mechanism itself is also simpler. Classic GPU clusters produce a bill with two line items: Databricks DBUs for the platform and cloud infrastructure costs for the VMs. You get two bills, from two vendors, and reconciling them requires some arithmetic.

AI Runtime bundles both into a single DBU rate under the "Model Training" SKU. One bill. Pay-per-second. That's the operational simplicity argument, independent of the absolute dollar cost.

Is AI Runtime cheaper than classic clusters for all workloads?

Not necessarily. For long-running training jobs where you're saturating the GPU the entire time (say, a multi-day pre-training run), idle-time savings are minimal and the classic cluster might be comparable or cheaper. AI Runtime's cost advantage is sharpest when your GPU utilization is bursty — experimentation, iterative fine-tuning, notebooks where you run training for an hour, adjust hyperparameters, re-run. That's most ML teams, most of the time.

How it fits into the rest of Databricks

AI Runtime isn't an island. It plugs into the Databricks ecosystem in ways that matter for production ML.

MLflow integration is automatic. When you run a distributed training job, Databricks auto-creates an MLflow experiment and starts streaming GPU utilization metrics, memory usage, and system-level signals into it. You don't call mlflow.start_run() — it just happens. If you do add mlflow.pytorch.autolog(), your model parameters and training metrics are also captured. Registered models go straight to Unity Catalog.

Your data stays in Unity Catalog. Training data in Delta tables, Parquet files in Volumes, or anything else governed through UC is accessible from AI Runtime compute with the same permissions model. This is the "Lakehouse-native training" argument: you don't copy data to an S3 bucket and hand IAM keys to your training job. The data governance doesn't change when you add GPUs.

Production orchestration works via Lakeflow Jobs. A notebook using AI Runtime can be scheduled as a Lakeflow Job. The Jobs API accepts a hardware_accelerator field. Your CI/CD pipeline, Databricks Asset Bundles, all the production engineering patterns — they work with serverless GPU exactly as they do with SQL warehouses and classic compute.

AI Runtime in the Databricks ecosystem

The real tradeoffs (don't skip this section)

I'd be doing you a disservice if I just described the benefits and stopped there. AI Runtime is not the right choice for every workload. Here's what you give up.

No RDD APIs. Classic clusters give you full Spark — DataFrames, RDDs, Datasets, Scala, R. AI Runtime only supports Spark Connect, which is a remote Spark connection over gRPC. Most DataFrame operations work fine, but if your code uses RDD-level operations or Scala UDFs, it won't run on serverless GPU.

No custom containers. With classic clusters you can use Databricks Container Services to build your own Docker image with whatever dependencies you need. Serverless GPU doesn't support this — you're working within the managed environment or installing packages at notebook startup. For most ML work this is fine, but if you have complex native library requirements, it could be a blocker.

No HIPAA or PCI compliance workspaces. The compute runs in Databricks' serverless plane, not in your cloud account's compliance boundary. If your data governance requirements mandate that all compute run within your VPC, AI Runtime doesn't qualify.

7-day maximum runtime. Serious pre-training runs can go for weeks. AI Runtime caps a single run at 7 days. For fine-tuning, that's almost never a constraint. For pre-training from scratch on large datasets, you'd need to implement checkpoint-and-restart logic.

Slower data loading for very large datasets. Your Unity Catalog Volumes are remote storage, accessed over the network. For large training datasets (tens of millions of samples), reading from /Volumes can bottleneck your GPU — the data pipeline can't keep up with what the GPU wants to consume. The recommended pattern is to cache preprocessed batches to local NVMe (/tmp) first, or use MosaicML Streaming for online data loading.

AI Runtime is designed for serious fine-tuning and experimentation. It's not yet positioned as a drop-in replacement for everything you'd do on a classic cluster.

How fast is "fast"?

The startup time story is worth its own section because it's one of the biggest operational differences.

Time-to-first-GPU-instruction (approximate)

Classic GPU cluster (p4d.24xlarge) 5–12 min

VM provisioning + CUDA init

AI Runtime (A10, pool available) 30s – 2 min

env hydrate

AI Runtime (H100, pool contended) 2–8 min

pool allocation + env

Startup time for AI Runtime depends on GPU pool availability. A10 pools are typically larger; H100s in high demand can take longer. You can also use reserved GPU pools (DATABRICKS_USE_POOL flag) to guarantee faster allocation.

The A10 case is genuinely fast. If a GPU is available in the pool, your notebook is talking to a live GPU in under two minutes — often much less. The H100 case is more variable, because H100 demand across the platform is higher.

For iterative experimentation, even the "slow" AI Runtime case (8 minutes for a contested H100) is competitive with a classic cluster cold start. And if you keep your serverless connection alive between runs, subsequent runs don't incur startup time at all.

The part where I admit what surprised me

When I first looked at AI Runtime, I assumed "serverless GPU" meant a convenience layer that would be fine for prototyping but not serious training. The 7-day limit, the no-custom-containers constraint, the Spark Connect limitation — I read those as signs that this was a product aimed at beginners.

I was wrong about that.

FactSet using it to build a production Text-to-Formula model. Rivian training multimodal vehicle AI. Databricks' own research team using it for a published reinforcement learning paper. These are not prototype use cases. The operational improvements — automatic MLflow, Unity Catalog integration, no idle billing, two-click distributed training — are genuinely useful at production scale, not just for experimentation.

The constraints are real, but they're narrow. If your workload is fine-tuning (which most enterprise ML is), doesn't require HIPAA compliance, doesn't depend on RDDs, and runs in under 7 days, AI Runtime removes a lot of overhead that wasn't adding value.

Main Takeaway

The main thing I want you to take away from this post is that serverless GPU is ready for production fine-tuning workflows. The tradeoffs are specific and knowable. If your workload fits within them — and most enterprise fine-tuning does — you get distributed GPU training with two-click setup, automatic MLflow tracking, no idle billing, and deep Lakehouse integration. That's a different product category from "GPU cluster with easier UI."

AI Runtime (Serverless GPU) is currently in Public Preview on AWS (us-east-1, us-west-2), Azure, and GCP. Workspace admins need to enable the preview in their settings before notebooks can connect to serverless GPU.

If you run into anything unexpected — data loading bottlenecks, NCCL errors, environment hydration failures — the best practices guide in the docs covers most of the common failure modes.

Serverless GPUsfrom the ground up

5–12 minutes. Every time.

Why is GPU cluster setup so complicated, anyway?

Starting from scratch: what does a GPU actually need?

The GPU communication problem

The @distributed decorator: what it's actually doing

The hardware: two GPUs, very different tradeoffs

What do people actually use this for?

The cost model: why "serverless" changes the economics

How it fits into the rest of Databricks

The real tradeoffs (don't skip this section)

How fast is "fast"?

The part where I admit what surprised me

Serverless GPUs
from the ground up

The `@distributed` decorator: what it's actually doing