MLOps

MLOps Tools We Actually Use in Production and Why We Picked Them

March 21, 20267 min read

There is no shortage of MLOps tool comparison posts. Most of them are written by people who have read the documentation, not run the tools under real production load.

This is the toolchain we actually use across our ML infrastructure engagements, why we chose each component over the alternatives, and where each one has let us down.

What we mean by production MLOps

Before listing tools, it is worth being specific about the problem. MLOps covers a range of concerns that are easy to conflate:

ConcernWhat it meansTools in this space
Training pipeline orchestrationScheduling, dependency management, retries, resource allocationKubeflow Pipelines, Airflow, Prefect, Argo Workflows
Experiment trackingLogging metrics, parameters, and artifacts per runMLflow, Weights and Biases, Neptune
Model registryVersioning, staging, and promoting models to productionMLflow Registry, Vertex AI
Serving infrastructureDeploying models as inference endpoints with autoscalingKServe, Seldon, FastAPI custom
MonitoringModel performance, data drift, infrastructure health post-deploymentPrometheus, Grafana, Evidently

Different tools own different parts of this. Picking one tool and assuming it covers everything, or picking five tools that duplicate each other, are both common mistakes.

Kubeflow Pipelines for orchestration

What it does: Kubeflow Pipelines lets you define ML workflows as directed acyclic graphs where each step runs in its own Kubernetes pod.

Why we chose it over alternatives: We evaluated Airflow, Prefect, and Argo Workflows before settling on Kubeflow Pipelines. Airflow is better suited to general data engineering. It was not designed for ML and the ML-specific features feel grafted on. Prefect is excellent for Python-native teams but does not have the same depth of Kubernetes integration. Argo Workflows is closer (Kubeflow Pipelines actually uses Argo Workflows under the hood) but requires more manual configuration for ML-specific use cases.

Kubeflow Pipelines wins because it treats ML artifacts as first-class citizens. Input datasets, output models, and evaluation metrics are tracked automatically per run with a lineage graph. When a model in production degrades, you trace it back to the exact run, the exact data version, and the exact hyperparameters.

Where it falls short: The Python SDK has a learning curve and the documentation has not kept pace with API changes between v1 and v2. Plan for a few days of setup before things feel natural.

MLflow for experiment tracking and model registry

What it does: MLflow tracks parameters, metrics, and artifacts during training runs. The model registry manages versioning and lifecycle from staging to production.

Why we chose it: We considered Weights and Biases and Neptune. Both are excellent, particularly W&B for visualisation. But they are SaaS products with per-seat pricing that scales up fast, and they require sending training data and model metadata to an external service. That is a non-starter for clients with data residency requirements.

MLflow is self-hosted, open source, and integrates with every major framework. The model registry feature works cleanly with our Kubeflow deployment: the pipeline logs metrics to MLflow and pushes the final model artifact to the MLflow registry. KServe can read MLflow model artifacts directly if you use the correct model flavor. Training run completes, KServe picks up the model from the registry, no manual step required.

Where it falls short: The MLflow UI is functional but not beautiful. If your team cares about experiment comparison visualisation, W&B is genuinely better. For production deployments where the UI is not the daily driver, MLflow is the right call.

ArgoCD for GitOps deployment

What it does: ArgoCD syncs Kubernetes cluster state with a Git repository. You define what should be running in Git, and ArgoCD ensures the cluster matches it.

Why we chose it: For ML infrastructure specifically, every configuration change is a Git commit. A new model version goes from staging to canary to production. Each transition is a configuration change. With ArgoCD, every change is auditable, reversible, and reviewable via pull request. The review and merge process becomes the deployment approval process.

Where it falls short: ArgoCD adds operational overhead. You maintain a GitOps repository structure, manage sync policies, and occasionally manually intervene when drift occurs. For teams not already comfortable with Kubernetes YAML, the initial setup takes time.

Prometheus and Grafana for infrastructure monitoring

What it does: Prometheus scrapes metrics from infrastructure and applications. Grafana visualises them and evaluates alert rules.

Why we chose it: The kube-prometheus-stack Helm chart gives you Prometheus, Alertmanager, Grafana, and pre-built Kubernetes dashboards in a single deployment. We extend it with the NVIDIA DCGM Exporter for GPU metrics per pod and custom metrics exported from Kubeflow Pipelines and MLflow. One Grafana instance shows cluster health, GPU utilisation per training job, pipeline execution times, and model serving latency.

The dashboard we build for every ML client: an overlay of training job runs, deployments, and production anomalies on a shared time axis. When a model in production starts degrading, the first question is "what changed?" This dashboard makes that answerable in under a minute.

Where it falls short: Prometheus is a pull-based system with local storage. For long-term metric retention, configure remote write to Thanos or VictoriaMetrics. The default 15-day retention is enough for operational monitoring but not for ML trend analysis over months.

Tools we evaluated and did not pick

ToolWhy we evaluated itWhy we did not pick it
AirflowMost widely used data pipeline toolNot designed for ML. ML features feel grafted on.
Weights and BiasesBest visualisation in the categorySaaS pricing scales fast. Data residency issues for some clients.
Seldon CoreRich A/B testing and multi-arm bandit featuresKServe is more actively maintained with better Kubernetes operator integration.
Feast (feature store)Good for complex feature engineering shared across modelsOverhead not justified for most clients we work with.
DVCDataset versioningGood tool, but for clients with S3 and versioning enabled, the overhead is not justified.

The right MLOps toolchain depends on your team size, data residency requirements, and how much operational overhead you can absorb. The stack above works well for production ML at startup and mid-market scale. Our MLOps service covers the full setup from Kubeflow deployment to KServe inference configuration and ongoing management.

If you are evaluating MLOps toolchains or struggling with an existing setup, book a free 30-minute call.

Want Results Like These for Your Stack?

We build production-grade infrastructure for AI startups and technical founders. Let's talk about your project.

Book a Free 30-Min Call

Your infra shouldn't be the thing slowing you down.

Book a free 30-minute call. We'll look at your current setup and tell you exactly what's costing you money, what's a deployment risk, and what we'd fix first. No pitch, no fluff.

AWSKubernetesDockerTerraformPythonReactArgoCDPrometheus