MLOps

Scaling ML with Kubernetes: What Production Actually Looks Like

March 7, 20267 min read

Every Kubernetes tutorial for ML covers the same ground. Containerise your model. Set up horizontal pod autoscaling. Use namespaces. Great. Now your first training job fails silently at 3x the dataset size and you have no idea why.

This is what scaling ML workloads with Kubernetes actually looks like once you are past the tutorials and running real pipelines in production.

Why Kubernetes for ML

The reason Kubernetes works for ML infrastructure is resource isolation and declarative scheduling. ML teams run wildly different workload shapes simultaneously: long-running training jobs that need full GPU nodes, short inference pods that need fractions of a GPU, data preprocessing jobs that need high memory and no GPU, and experiment tracking services that run 24/7 at low resource.

Without a cluster scheduler that handles all of these on shared infrastructure with isolation guarantees, you either overprovision (expensive) or have teams blocking each other (slow). Kubernetes solves this when configured correctly.

The GPU scheduling problem most teams hit first

The default Kubernetes scheduler does not understand GPU topologies. It knows a node has GPUs but it does not know whether two GPUs share NVLink bandwidth, whether you need GPUs from the same NUMA node for a multi-GPU job, or whether a node is already running a memory-intensive job that will compete with yours.

We had a client running 4-GPU training jobs on 8-GPU nodes. Half the time the scheduler placed the job across four GPUs on different PCIe switches with no NVLink. Training was 30 to 40% slower than it should have been, consistently and silently, because the jobs finished, just not efficiently.

FixWhat it doesWhere to configure
NVIDIA GPU OperatorInstalls device plugin, DCGM exporter, node feature discoveryHelm chart deployment
Topology-aware schedulingEnsures multi-GPU jobs land on GPUs with NVLinktopologySpreadConstraints + node labels
Node affinity by GPU typeTraining jobs stay on training nodes, inference on inference nodesnodeSelector in job spec
Namespace GPU quotasPrevents one team consuming all cluster GPU capacityResourceQuota per namespace

The Kubernetes GPU scheduling docs cover the device plugin setup. The topology constraints require additional node labelling based on your specific hardware layout.

Kubeflow Pipelines: the value is not the UI

Kubeflow Pipelines is the right tool for ML pipeline orchestration on Kubernetes. The way it is typically introduced, as a UI-first drag-and-drop system, undersells how you actually get value from it.

The value is not the UI. The value is that every pipeline run is a Kubernetes-native artifact. Every step has its own pod, its own resource request, its own logs, its own retry policy. When step 4 of a 7-step pipeline fails, you rerun from step 4. You do not restart from scratch.

What we configure for clients:

  • Pipeline components defined as Docker containers with explicit input/output contracts. No shared state between steps.
  • Resource requests per component. Data loading step gets high memory and no GPU. Training step gets GPU. Evaluation gets CPU only.
  • Caching enabled for expensive steps. If data preprocessing ran successfully yesterday on the same data hash, skip it.
  • Output artifacts stored in S3 or GCS with versioning. Every run produces traceable lineage from raw data to model artifact.

The pipeline YAML becomes the source of truth for how an ML workflow runs. Version-controlled, reproducible, and debuggable in a way that ad-hoc training scripts are not.

KServe for inference, not a custom FastAPI server

The default pattern for ML inference on Kubernetes is to wrap a model in FastAPI, containerise it, and deploy it as a standard Deployment. This works until you need model versioning, canary deployments between model versions, autoscaling based on request rate, or GPU fractioning across multiple models.

KServe handles all of this natively:

  • InferenceService CRD: declare your model server, its storage location, and its resource requirements. KServe manages the pod lifecycle.
  • Canary rollouts: route 10% of traffic to a new model version while 90% hits the current one. Promote or roll back based on metrics.
  • Scale to zero: inference pods that receive no traffic scale down and cold-start when requests arrive. For models serving intermittent traffic, this is significant cost savings.
  • Multi-model serving: pack multiple smaller models onto a single GPU instance via model agent rather than one pod per model.

For most production inference use cases, KServe is worth the setup overhead over a custom deployment.

The HPA trap for training workloads

Horizontal Pod Autoscaling works for stateless services. It does not work for training jobs the way people expect.

The trap is applying HPA to training pods. A training job is not a stateless service that scales horizontally on demand. It is a fixed distributed job where the number of workers is decided at launch based on dataset and model size. Adding a worker mid-run requires the job framework to support it (Horovod with elastic training handles this, but it is not the default).

For training workloads, the right patterns are:

  • Kueue for Kubernetes-native job queuing. Jobs wait in queue, get admitted when resources are available, run to completion.
  • Karpenter for node autoscaling on cloud clusters. When a job requests GPU nodes that do not exist yet, Karpenter provisions them. When the job finishes, the nodes terminate. You pay only for what you use.
  • HPA only on inference services, where horizontal scaling on request rate is the correct model.

Monitoring ML workloads on Kubernetes

Standard Kubernetes monitoring covers pod health, resource usage, and node status. It does not cover what matters for ML.

Monitoring needToolWhat it surfaces
GPU utilisation per podNVIDIA DCGM ExporterSM utilisation, memory, power draw, NVLink bandwidth
Training job progressKubeflow Pipeline metrics to PrometheusTraining loss, validation accuracy, custom metrics per run
Inference latencyPrometheus + Grafanap50, p95, p99 latency per model endpoint
GPU memory OOM eventsDCGM Exporter + AlertmanagerCUDA OOM events that do not always crash the pod
Pipeline healthKubeflow UI + custom dashboardsStep duration, failure rate, cache hit rate

The insight that surprises most teams: GPU memory OOM events in Kubernetes do not always crash the pod. Sometimes the CUDA driver handles the error internally and the job continues but runs 5x slower because it fell back to CPU operations. You only know this happened if you are watching the right metrics.

Things that require ongoing attention

Storage class selection. NFS-backed PVCs on most cloud providers are slow for large file reads. Use EFS with high-throughput mode for AWS, or go direct to S3 with a dataset library that supports streaming reads. The wrong storage class can make the data pipeline the bottleneck regardless of GPU speed.

Container image size. ML containers with full CUDA toolkit, PyTorch, and model weights can be 15 to 20GB. Use multi-stage builds to separate the CUDA base from the application layer, and cache aggressively in your registry.

Namespace GPU quotas. Without hard quotas, one team's experiment can consume all cluster GPU capacity. Define quotas and build a lightweight process for temporary quota increases.

We have been running Kubernetes-based ML infrastructure for clients since before Kubeflow reached 1.0. The platform has matured. The configuration mistakes have not changed much. Our MLOps service covers GPU cluster setup, Kubeflow pipeline deployment, and KServe inference configuration for production workloads.

If you are setting up ML infrastructure on Kubernetes or running into scaling issues with an existing setup, book a free 30-minute call.

Want Results Like These for Your Stack?

We build production-grade infrastructure for AI startups and technical founders. Let's talk about your project.

Book a Free 30-Min Call

Your infra shouldn't be the thing slowing you down.

Book a free 30-minute call. We'll look at your current setup and tell you exactly what's costing you money, what's a deployment risk, and what we'd fix first. No pitch, no fluff.

AWSKubernetesDockerTerraformPythonReactArgoCDPrometheus