MLOps

GPU Workload Optimization: What Actually Moves the Needle

March 14, 20266 min read

Most GPU optimization guides tell you to tune your batch size and enable mixed precision training. That is useful, but it is also table stakes. Every ML engineer knows that.

What actually costs teams time and money is what happens before the model starts training, and the infrastructure decisions made months earlier that nobody revisits. We manage GPU clusters for AI clients across AWS, on-prem bare metal, and hybrid setups. These are the things we actually find and fix when doing a GPU infrastructure review.

The real problem is usually not the GPU

The most common situation we see: GPU utilisation sitting at 30 to 40% while the team assumes the model is just slow. You check nvidia-smi and the numbers look plausible. But the GPU is waiting.

It is almost always one of three things: the data pipeline cannot keep up, memory is being mismanaged between jobs, or the cluster scheduler is making poor placement decisions. None of these show up obviously in standard monitoring until you know where to look.

Root causeSymptomWhat it looks like in metrics
Data pipeline bottleneckGPU idle between batchesLow SM utilisation, normal memory usage
Memory fragmentation between jobsNext job starts slow or failsGPU memory allocation errors or OOM in logs
Bad scheduler placement (multi-GPU)Training slower than expectedCorrect utilisation but poor throughput
Wrong GPU type for workloadHigh cost, low performanceA100 running inference at 20% utilisation

Data loading is where most training time gets lost

A GPU processes batches faster than most data pipelines supply them. If your data loading is sequential, your GPU is idle for a significant fraction of every training step.

The fix is straightforward but frequently skipped:

  • Set num_workers greater than 0 in PyTorch DataLoader. Start at the number of available CPU cores and tune from there.
  • Enable pin_memory=True when training on GPU. This speeds up CPU to GPU memory transfer.
  • Use NVIDIA DALI for heavy image preprocessing. It moves preprocessing onto the GPU itself, freeing the CPU entirely.
  • Prefetch aggressively. The next batch should always be loading while the current one is training.

We benchmarked this on a client's image classification pipeline. Going from a single-worker DataLoader to a properly configured multi-worker setup with DALI cut per-epoch time by 38% without touching the model.

Kubernetes GPU scheduling: where clusters quietly waste money

If you are running GPU workloads on Kubernetes and have not revisited your scheduling configuration, there is a good chance capacity is being wasted.

Specific issues we see most often:

No GPU resource limits set. A pod requests one GPU and uses it correctly, but another pod gets scheduled to the same node and competes for GPU time. Set explicit nvidia.com/gpu resource requests and limits on every GPU workload.

Jobs not using node affinity. If you have mixed GPU types in your cluster (training nodes and inference nodes), jobs should be pinned to the right pool. A small inference job landing on an A100 training node is money wasted. See the Kubernetes GPU scheduling docs for the right configuration.

No GPU time-sharing for inference. NVIDIA MIG on A100 and H100 lets you partition a single GPU into isolated instances. For inference workloads that do not saturate a full GPU, MIG can run 4 to 7 inference pods on a single card that would otherwise sit at 20 to 30% utilisation.

Missing resource quotas per namespace. Without quotas, a single pipeline can consume all cluster GPU capacity. Namespace-level GPU quotas prevent this without requiring manual intervention.

Mixed precision: real but needs doing correctly

FP16 training via PyTorch AMP or TensorFlow mixed precision is genuinely impactful on Volta, Turing, Ampere, and Ada architectures. You get roughly 2x throughput on matrix operations and half the memory footprint. That means larger batch sizes.

The detail that trips people up is loss scaling. When training in FP16, small gradient values can underflow to zero. AMP handles this automatically with dynamic loss scaling, but watch for:

  • Loss scale going to zero repeatedly: usually means learning rate is too high
  • NaN losses early in training: usually a poorly initialised layer that benefits from FP32 warmup
  • Models with batch normalisation layers that need careful handling at reduced precision

For most standard architectures, AMP just works. Enable it, verify the loss curve is stable in the first 100 steps, and move on.

Memory management between jobs

On shared clusters, GPU memory fragmentation between jobs is a silent performance problem.

When a training job finishes, PyTorch does not always release memory back to the OS immediately. The next job starts, requests memory, and may fail or run slowly because the allocator is working around fragmented blocks.

Two things that help: call torch.cuda.empty_cache() at the end of training scripts. This releases the allocator cache so the next job starts clean. If you are using Kubernetes job queues, add a brief pod cleanup delay between jobs on the same node. Even 30 seconds is enough for the CUDA context to fully release.

Profiling first, optimising second

ToolWhat it surfacesWhen to use it
NVIDIA Nsight SystemsCPU/GPU overlap, data transfer time, kernel executionStart here for any unknown bottleneck
PyTorch ProfilerOperator-level breakdown, TensorBoard integrationBottlenecks in specific model layers
nvidia-smi dmonStreaming GPU metrics at 1-second intervalsLong-running jobs, catch utilisation drops over time
NVIDIA DCGM ExporterSM utilisation, memory, temperature, NVLink per podContinuous production monitoring across the cluster

The metric that matters most in production is SM (streaming multiprocessor) utilisation, not overall GPU utilisation. A GPU can show 70% overall utilisation while SMs sit at 30% if memory bandwidth is the bottleneck. These are different problems with different fixes.

The infrastructure decision that matters most

All tuning aside, wrong hardware for the workload undoes everything:

Using A100s for inference. A100 is optimised for training. For inference, an A10G or L4 typically delivers better throughput per dollar. If A100s are running inference 24/7, you are paying for training capability you are not using.

On-prem clusters with no burst path. If on-prem GPU capacity hits its limit, jobs queue. Setting up EKS with Karpenter and GPU node pools gives you on-demand burst without a permanent cloud commitment.

GPU optimization is a review process, not a one-time task. The cluster configured for a specific workload six months ago is probably running three different workloads now, and none of the original sizing assumptions still hold. A quarterly review takes half a day and almost always surfaces at least one real improvement.

Our MLOps service covers GPU cluster audits, Kubernetes GPU scheduling configuration, and ongoing infrastructure management for AI workloads.

If you are running GPU clusters and have not reviewed your scheduling configuration recently, book a free 30-minute call.

Want Results Like These for Your Stack?

We build production-grade infrastructure for AI startups and technical founders. Let's talk about your project.

Book a Free 30-Min Call

Your infra shouldn't be the thing slowing you down.

Book a free 30-minute call. We'll look at your current setup and tell you exactly what's costing you money, what's a deployment risk, and what we'd fix first. No pitch, no fluff.

AWSKubernetesDockerTerraformPythonReactArgoCDPrometheus