Open your Azure portal right now. Go to Azure ML workspaces. Count the compute instances.
I'll bet you'll find a pattern: every data scientist has their own GPU VM. Some are running. Some have been idle for days. A few haven't been touched in weeks—but nobody shut them down because "I might need it later." Each one is burning $5-15 per hour whether it's training a model or sitting at a Jupyter login screen.
This is the default state of GPU compute in most ML teams. And it's absurdly wasteful.
I've lived this problem. Our Azure bill had a line item for GPU compute that made finance ask uncomfortable questions. The answer was always the same: data scientists need GPUs, Azure ML Compute Instances are the easiest path, and nobody wants to deal with Kubernetes. So everyone gets their own VM, utilization hovers around 15%, and we pay for 100%.
The move from dedicated VMs to orchestrated Kubernetes GPU workloads cut our GPU spend by more than half. Here's the playbook.
The Azure ML VM Trap
Let's be honest about why teams end up here. Azure ML Compute Instances are seductive:
- Zero friction: Click a button, pick a GPU SKU, get a Jupyter notebook in 5 minutes
- Familiar environment: It's just a VM. Install whatever you want.
pip installuntil it works - Isolation: My VM, my packages, my data, no conflicts with anyone else
- No Kubernetes knowledge required: The average data scientist shouldn't need to know what a Pod is
These are legitimate advantages. For prototyping and exploration, a dedicated VM is fast and flexible. The problem isn't using VMs—it's using VMs as your production GPU strategy.
Here's what happens at scale:
- 10 data scientists × 1 GPU VM each = 10 GPUs permanently allocated, even when 7 of them are in meetings
- No sharing: When a training job finishes, the GPU sits idle until the next experiment. Nobody else can use it
- No scheduling: If all VMs are busy and someone needs to run an urgent training job, the answer is "wait" or "spin up another VM"
- No autoscaling: VMs don't scale to zero when unused. You pay 24/7 for 9-to-5 usage
- Snowflake environments: Every VM has different packages, different CUDA versions, different "it works on my machine" problems
The total cost of this approach isn't just the Azure bill. It's the wasted GPU hours, the blocked experiments, and the operational fragility of managing a fleet of snowflake VMs.
The Migration Path: VMs to AKS
Moving from Azure ML VMs to AKS GPU orchestration isn't a weekend project. But the path is well-defined:
- Containerize workloads: Package training scripts and inference services into Docker images with pinned dependencies
- Set up AKS with GPU node pools: Terraform-managed, with autoscaling configured from day one
- Implement GPU sharing: MIG, MPS, or time-slicing depending on the workload type
- Add batch scheduling: Kueue for fair queuing and priority management
- Give data scientists a friendly interface: JupyterHub on K8s, or Kubeflow Notebooks—same Jupyter experience, shared GPU backend
The key insight: data scientists don't need to learn Kubernetes. They need a Jupyter notebook that launches on shared infrastructure instead of a dedicated VM. The orchestration is invisible to them.
GPU Sharing: MIG, MPS, and Time-Slicing
Once your workloads are on AKS, NVIDIA gives you three ways to share a single physical GPU. Each has trade-offs.
Multi-Instance GPU (MIG)
MIG physically partitions an A100 or H100 into up to seven isolated instances. Each gets its own memory, cache, and compute cores. Like cutting a pizza into slices before serving—each person gets a guaranteed portion.
When to use it: Inference workloads with predictable memory requirements. If your model needs 10GB of VRAM, create a MIG slice that fits exactly.
When to avoid it: Training jobs needing the full GPU, or workloads with variable memory needs. MIG partitions are static—defined at node configuration time.
On AKS with Terraform:
resource "azurerm_kubernetes_cluster_node_pool" "gpu_mig" {
name = "gpumig"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = "Standard_NC24ads_A100_v4"
node_count = 2
node_labels = {
"nvidia.com/mig.config" = "all-3g.20gb"
}
}
Multi-Process Service (MPS)
MPS lets multiple CUDA processes share a GPU simultaneously through time-division multiplexing. No hardware isolation—processes share memory and compute, but the GPU driver manages context switching.
When to use it: Multiple small inference workloads that individually use less than 30% of GPU compute.
When to avoid it: Security-sensitive multi-tenant environments (no memory isolation) and training workloads (context switching kills throughput).
Time-Slicing
The simplest approach. The NVIDIA device plugin advertises a single GPU as multiple "virtual" GPUs. Kubernetes sees 4 GPUs instead of 1. Workloads take turns.
When to use it: Development and experimentation—the fastest path to better utilization. This is your direct replacement for the VM model. Instead of 10 data scientists on 10 VMs with 10 GPUs, you have 10 data scientists on 3 GPUs with time-slicing. Same Jupyter experience, 70% less cost.
When to avoid it: Production inference with latency SLAs. Latency becomes unpredictable.
My Recommendation
- Time-slicing for development/experimentation (replaces VM sprawl)
- MIG for production inference (guaranteed resources, isolation)
- MPS only for specific high-throughput inference pipelines
Scheduling: Kueue vs. Volcano
Shared GPUs are pointless if your scheduler can't manage who gets them and when. Default Kubernetes scheduling is first-come-first-served—the same "whoever grabs the VM first wins" problem you already have, just on Kubernetes.
Kueue (The Kubernetes Way)
Kueue is the official Kubernetes SIG-approved batch scheduling system. It introduces ClusterQueues and LocalQueues to manage resource quotas and priorities.
What makes Kueue powerful:
- Fair sharing: Define quotas per team. Data scientists get 40%, ML engineers get 40%, experimentation gets 20%
- Preemption: Low-priority notebook sessions get evicted when a high-priority training job needs resources
- Borrowing: If the ML engineering queue is empty, data scientists can temporarily use those GPUs
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: ml-training
spec:
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: a100-mig-3g
resources:
- name: "nvidia.com/gpu"
nominalQuota: 8
borrowingLimit: 4
Volcano
Volcano focuses on gang scheduling—ensuring all pods in a distributed training job start simultaneously or not at all. Prevents the classic deadlock where 3 of 4 required GPU pods start and sit idle waiting for the 4th.
My take: Use Kueue for most workloads. Add Volcano only for distributed training across multiple nodes (Horovod, DeepSpeed, PyTorch DDP).
Cost Optimization Patterns
Spot Instances for Training
Azure Spot VMs save 60-90% on GPU compute. Training jobs are perfect candidates—fault-tolerant, long-running, flexible on timing.
resource "azurerm_kubernetes_cluster_node_pool" "gpu_spot" {
name = "gpuspot"
kubernetes_cluster_id = azurerm_kubernetes_cluster.main.id
vm_size = "Standard_NC24ads_A100_v4"
priority = "Spot"
eviction_policy = "Delete"
spot_max_price = -1
node_taints = ["kubernetes.azure.com/scalesetpriority=spot:NoSchedule"]
}
Critical: Always implement checkpointing. Without it, a spot eviction means restarting from scratch. With it, you lose 30 minutes at most.
Scale-to-Zero
This is the single biggest cost saver versus the VM model. Configure the AKS cluster autoscaler to scale GPU node pools to zero when no jobs are pending:
- Set
min_count = 0on GPU node pools - Use Kueue to queue jobs—they trigger scale-up automatically
- Set aggressive scale-down delays (10 minutes, not the default 30)
This is impossible with Azure ML Compute Instances. A VM is either on or off. There's no "scale to zero and spin up when someone submits a job." This alone cut our GPU bill by 45% because weekends and nights had near-zero utilization.
Right-Sizing GPU SKUs
Not every workload needs an A100:
| Workload | Recommended SKU | Monthly Cost (approx.) |
|---|---|---|
| Small inference | T4 (NC4as_T4_v3) | ~$500 |
| Medium training | A10 (NC8ads_A10_v4) | ~$1,500 |
| Large training | A100 (NC24ads_A100_v4) | ~$7,500 |
| Distributed training | Multiple A100s | $15,000+ |
With VMs, people pick the biggest GPU "just in case." With Kueue and node affinity, you route workloads to the right SKU automatically.
Monitoring: What to Track
You can't optimize what you can't measure. Compare these to the VM world where your only metric was "is the VM running?":
- GPU utilization % (target: >70% for production, >40% for dev)
- GPU memory utilization (identifies over-provisioned workloads)
- Queue wait time (how long jobs wait for GPUs)
- Cost per training run (track trends, not absolutes)
- Spot eviction rate (if >20%, consider reserved instances for critical workloads)
NVIDIA's DCGM exporter plus Prometheus gives you all of this. On AKS, Azure Monitor also integrates with GPU metrics natively.
The Terraform Advantage
Everything described—node pools, autoscaler config, MIG profiles, spot priorities—lives in Terraform. Not in click-ops, not in someone's Azure portal bookmark. Infrastructure as code means:
- Reproducible environments across dev/staging/prod
- Version-controlled changes with PR reviews
- Disaster recovery in minutes, not days
- New team members can understand the entire GPU topology by reading
.tffiles
If your GPU infrastructure isn't in Terraform, you're carrying the same "snowflake environment" risk you had with VMs—just at a different layer.
The Bottom Line: The path from "every data scientist gets a GPU VM" to orchestrated Kubernetes workloads isn't about Kubernetes for its own sake. It's about going from 15% GPU utilization at full cost to 70%+ utilization at half the cost—while giving your team the same Jupyter experience they already know. Share the hardware, schedule the workloads, scale to zero when idle. Your finance team will thank you.