Serverless GPUs
from the ground up
What "serverless" actually means when a single GPU costs $30/hr to rent — and why that changes everything about how you train models.
5–12 minutes. Every time.
That's how long it takes to start a GPU cluster on Databricks the classic way. You open the compute tab, pick an instance type, choose a runtime version, configure autoscaling, pick a driver version, set up your spot instance policy, and wait. Then you do it again next week when you need slightly different hardware.
I'm not being unfair. This is just the reality of managed GPU infrastructure. It's not a Databricks problem — it's a fundamental property of spinning up cloud VMs. Somebody has to provision the hardware. Somebody has to install CUDA. Somebody has to make sure NCCL can find all the nodes.
For a while, that somebody was you.
Databricks AI Runtime changes that. You open a notebook, click "Connect", select a GPU type from a dropdown, and within seconds to a couple of minutes you're running PyTorch code on a real NVIDIA GPU. No cluster configuration. No driver selection. No autoscaling policy. No idle charges.
That's the headline. But the headline understates what's actually interesting here. In this post, we're going to cover:
• Why GPU cluster setup is genuinely hard — and what infrastructure you're actually dealing with
• What "serverless GPU" means under the hood — where the hardware lives and how your code gets there
• How the distributed training API works — including what that
@distributed decorator is doing behind the scenes• The cost model — and why idle GPU time matters more than you think
• What the tradeoffs are — because there are real ones, and you should know them before you commit
If you already know how CUDA and distributed training work and just want to see the API, feel free to jump straight to the distributed training section. But if you want to understand why the design is the way it is, start here.
Why is GPU cluster setup so complicated, anyway?
To understand what AI Runtime is solving, you need to understand what it's replacing.
When you set up a classic Databricks cluster with GPU workers, you're actually configuring a lot of things:
The instance type question alone is genuinely hard. A100s and H100s have different memory bandwidth profiles. A10s are great for inference and fine-tuning but won't saturate high-bandwidth NVLink interconnects. The p4d.24xlarge gives you 8x A100 40GB but a different price profile than p4de.24xlarge with 8x A100 80GB. You're making hardware architecture decisions before you've written a single line of training code.
AI Runtime makes those decisions for you. You pick A10 or H100. Databricks handles the rest.
In Databricks' own serverless compute plane — not in your AWS/Azure/GCP account. This is the key architectural difference. Classic clusters provision VMs in your cloud account (with your IAM roles, your VPCs, your instance quotas). Serverless GPU runs in Databricks-managed infrastructure. Your code and data still access Unity Catalog as normal, but the compute lives elsewhere.
Starting from scratch: what does a GPU actually need?
Let's build this from the bottom up. Forget Databricks for a second — what does it actually take to run a PyTorch training job across multiple GPUs?
At minimum, you need:
A physical GPU with enough VRAM to hold your model. A CUDA toolkit that matches the GPU driver version. A PyTorch installation compiled against that CUDA version. A way for multiple GPUs to communicate with each other during training.
The last one is where things get interesting.
The GPU communication problem
Modern LLM training doesn't just use one GPU. A 70B parameter model at full precision requires about 140GB of GPU memory — way more than a single H100's 80GB. So you split the model across multiple GPUs and run distributed training.
The GPUs need to constantly talk to each other — syncing gradients after each backward pass, sharing activations during forward passes, reducing parameters across devices. This inter-GPU communication is handled by a library called NCCL (NVIDIA Collective Communications Library), and getting NCCL to work correctly across nodes is... not trivial.
NVLink interconnects GPUs within a node at ~900 GB/s. Across nodes, communication falls back to network fabric — which is why multi-node training has higher latency than single-node.
NCCL needs to know the network topology. It needs proper process group initialization. It needs a "rendezvous" mechanism where rank-zero (the coordinator process) broadcasts connection information to all other ranks. If anything in that chain is wrong — the wrong IP, a firewall rule, a mismatched NCCL version — your training job silently hangs or crashes with an error that looks like a networking problem.
AI Runtime handles all of this. When you use the @distributed decorator, Databricks automatically sets up the process group, the rendezvous, and the NCCL environment. You don't configure any of it.
The @distributed decorator: what it's actually doing
Here's the thing about the distributed training API that surprised me when I first looked at it: it's doing a lot of work behind a very thin surface area.
The API looks like this:
from serverless_gpu import distributed
@distributed(num_gpus=8, gpu_type="H100")
def train(rank, world_size):
# This function runs on ALL 8 GPUs simultaneously
# rank 0 is the coordinator; ranks 1-7 are workers
model = MyModel().to("cuda")
model = DDP(model, device_ids=[rank])
for batch in dataloader:
loss = model(batch)
loss.backward()
optimizer.step()
# Kick off the distributed run — notebook blocks until done
train.distributed()
Simple, right? But what actually happens when you call train.distributed()?
train.distributed()When you call train.distributed(), the library:
_air.py is auto-generated as the entrypoint.import transformers on your notebook works identically on rank-3 of a remote H100.MASTER_ADDR, MASTER_PORT, assigns ranks 0 through N-1, and launches the function on all GPUs simultaneously using torchrun-style process launch. This is the same mechanism you'd configure manually with distributed training, just automated.@distributed work with frameworks other than raw PyTorch DDP?Yes. The environment snapshots your installed packages, so if you have Hugging Face Accelerate, DeepSpeed, or Axolotl installed, those work as normal inside the function. The Databricks environment v4 ships with Transformers 4.56.1, PEFT 0.17.1, Accelerate 1.10.1, and Ray 2.49.1 pre-installed — so for most fine-tuning workflows you're not installing anything extra.
The hardware: two GPUs, very different tradeoffs
AI Runtime currently gives you a choice of two GPUs. The choice is real — they're not interchangeable.
A10s are the pragmatic choice. At 24GB per card, you can fine-tune a 7–13B model with LoRA on a single A10 without breaking a sweat. They support multi-node, which means you can spread across more than one physical machine — useful if you need more than 32 A10s (which would be a very large workload).
H100s are the heavy machinery. At 80GB per card with HBM3 memory bandwidth, they're what you want for full-precision fine-tuning of 30B+ models, or when raw throughput matters. The tradeoff is that multi-node H100 is still in Private Preview — so you're currently capped at one 8-GPU node (640GB total VRAM). That's enough for most things, but not everything.
Not currently available in AI Runtime. The H100 is the successor to the A100 with roughly 3× the training throughput on transformer workloads due to Hopper architecture improvements (transformer engine, FP8 support, higher memory bandwidth). Databricks has also announced upcoming support for NVIDIA's Blackwell RTX PRO 4500 Server Edition from GTC 2026, but that doesn't have a Public Preview date yet.
What do people actually use this for?
Let's make this concrete. These are the documented use cases, each with the setup that makes sense for them.
The cost model: why "serverless" changes the economics
The pricing model for AI Runtime is fundamentally different from classic GPU clusters, and the difference matters more than you might expect.
Classic GPU clusters charge you for the entire lifetime of the cluster — including idle time. If your cluster is running but your notebook isn't actively executing, you're still paying. This creates perverse incentives: teams tend to keep clusters alive longer than necessary (to avoid the 5–12 minute restart), which means more idle charges.
AI Runtime auto-terminates after 60 minutes of inactivity. You only pay for active compute seconds. There's no cluster to forget to shut down.
Approximate — based on A10 GPU-hours at typical cloud rates. Actual Databricks DBU pricing varies by contract.
The billing mechanism itself is also simpler. Classic GPU clusters produce a bill with two line items: Databricks DBUs for the platform and cloud infrastructure costs for the VMs. You get two bills, from two vendors, and reconciling them requires some arithmetic.
AI Runtime bundles both into a single DBU rate under the "Model Training" SKU. One bill. Pay-per-second. That's the operational simplicity argument, independent of the absolute dollar cost.
Not necessarily. For long-running training jobs where you're saturating the GPU the entire time (say, a multi-day pre-training run), idle-time savings are minimal and the classic cluster might be comparable or cheaper. AI Runtime's cost advantage is sharpest when your GPU utilization is bursty — experimentation, iterative fine-tuning, notebooks where you run training for an hour, adjust hyperparameters, re-run. That's most ML teams, most of the time.
How it fits into the rest of Databricks
AI Runtime isn't an island. It plugs into the Databricks ecosystem in ways that matter for production ML.
MLflow integration is automatic. When you run a distributed training job, Databricks auto-creates an MLflow experiment and starts streaming GPU utilization metrics, memory usage, and system-level signals into it. You don't call mlflow.start_run() — it just happens. If you do add mlflow.pytorch.autolog(), your model parameters and training metrics are also captured. Registered models go straight to Unity Catalog.
Your data stays in Unity Catalog. Training data in Delta tables, Parquet files in Volumes, or anything else governed through UC is accessible from AI Runtime compute with the same permissions model. This is the "Lakehouse-native training" argument: you don't copy data to an S3 bucket and hand IAM keys to your training job. The data governance doesn't change when you add GPUs.
Production orchestration works via Lakeflow Jobs. A notebook using AI Runtime can be scheduled as a Lakeflow Job. The Jobs API accepts a hardware_accelerator field. Your CI/CD pipeline, Databricks Asset Bundles, all the production engineering patterns — they work with serverless GPU exactly as they do with SQL warehouses and classic compute.
The real tradeoffs (don't skip this section)
I'd be doing you a disservice if I just described the benefits and stopped there. AI Runtime is not the right choice for every workload. Here's what you give up.
No RDD APIs. Classic clusters give you full Spark — DataFrames, RDDs, Datasets, Scala, R. AI Runtime only supports Spark Connect, which is a remote Spark connection over gRPC. Most DataFrame operations work fine, but if your code uses RDD-level operations or Scala UDFs, it won't run on serverless GPU.
No custom containers. With classic clusters you can use Databricks Container Services to build your own Docker image with whatever dependencies you need. Serverless GPU doesn't support this — you're working within the managed environment or installing packages at notebook startup. For most ML work this is fine, but if you have complex native library requirements, it could be a blocker.
No HIPAA or PCI compliance workspaces. The compute runs in Databricks' serverless plane, not in your cloud account's compliance boundary. If your data governance requirements mandate that all compute run within your VPC, AI Runtime doesn't qualify.
7-day maximum runtime. Serious pre-training runs can go for weeks. AI Runtime caps a single run at 7 days. For fine-tuning, that's almost never a constraint. For pre-training from scratch on large datasets, you'd need to implement checkpoint-and-restart logic.
Slower data loading for very large datasets. Your Unity Catalog Volumes are remote storage, accessed over the network. For large training datasets (tens of millions of samples), reading from /Volumes can bottleneck your GPU — the data pipeline can't keep up with what the GPU wants to consume. The recommended pattern is to cache preprocessed batches to local NVMe (/tmp) first, or use MosaicML Streaming for online data loading.
How fast is "fast"?
The startup time story is worth its own section because it's one of the biggest operational differences.
Startup time for AI Runtime depends on GPU pool availability. A10 pools are typically larger; H100s in high demand can take longer. You can also use reserved GPU pools (DATABRICKS_USE_POOL flag) to guarantee faster allocation.
The A10 case is genuinely fast. If a GPU is available in the pool, your notebook is talking to a live GPU in under two minutes — often much less. The H100 case is more variable, because H100 demand across the platform is higher.
For iterative experimentation, even the "slow" AI Runtime case (8 minutes for a contested H100) is competitive with a classic cluster cold start. And if you keep your serverless connection alive between runs, subsequent runs don't incur startup time at all.
The part where I admit what surprised me
When I first looked at AI Runtime, I assumed "serverless GPU" meant a convenience layer that would be fine for prototyping but not serious training. The 7-day limit, the no-custom-containers constraint, the Spark Connect limitation — I read those as signs that this was a product aimed at beginners.
I was wrong about that.
FactSet using it to build a production Text-to-Formula model. Rivian training multimodal vehicle AI. Databricks' own research team using it for a published reinforcement learning paper. These are not prototype use cases. The operational improvements — automatic MLflow, Unity Catalog integration, no idle billing, two-click distributed training — are genuinely useful at production scale, not just for experimentation.
The constraints are real, but they're narrow. If your workload is fine-tuning (which most enterprise ML is), doesn't require HIPAA compliance, doesn't depend on RDDs, and runs in under 7 days, AI Runtime removes a lot of overhead that wasn't adding value.
The main thing I want you to take away from this post is that serverless GPU is ready for production fine-tuning workflows. The tradeoffs are specific and knowable. If your workload fits within them — and most enterprise fine-tuning does — you get distributed GPU training with two-click setup, automatic MLflow tracking, no idle billing, and deep Lakehouse integration. That's a different product category from "GPU cluster with easier UI."
AI Runtime (Serverless GPU) is currently in Public Preview on AWS (us-east-1, us-west-2), Azure, and GCP. Workspace admins need to enable the preview in their settings before notebooks can connect to serverless GPU.
If you run into anything unexpected — data loading bottlenecks, NCCL errors, environment hydration failures — the best practices guide in the docs covers most of the common failure modes.