There is no shortage of MLOps tool comparison posts. Most of them are written by people who have read the documentation, not run the tools under real production load.

This is the toolchain we actually use across our ML infrastructure engagements, why we chose each component over the alternatives, and where each one has let us down.

What we mean by production MLOps

Before listing tools, it is worth being specific about the problem. MLOps covers a range of concerns that are easy to conflate:

Concern	What it means	Tools in this space
Training pipeline orchestration	Scheduling, dependency management, retries, resource allocation	Kubeflow Pipelines, Airflow, Prefect, Argo Workflows
Experiment tracking	Logging metrics, parameters, and artifacts per run	MLflow, Weights and Biases, Neptune
Model registry	Versioning, staging, and promoting models to production	MLflow Registry, Vertex AI
Serving infrastructure	Deploying models as inference endpoints with autoscaling	KServe, Seldon, FastAPI custom
Monitoring	Model performance, data drift, infrastructure health post-deployment	Prometheus, Grafana, Evidently

Different tools own different parts of this. Picking one tool and assuming it covers everything, or picking five tools that duplicate each other, are both common mistakes.

Kubeflow Pipelines for orchestration

What it does: Kubeflow Pipelines lets you define ML workflows as directed acyclic graphs where each step runs in its own Kubernetes pod.

Why we chose it over alternatives: We evaluated Airflow, Prefect, and Argo Workflows before settling on Kubeflow Pipelines. Airflow is better suited to general data engineering. It was not designed for ML and the ML-specific features feel grafted on. Prefect is excellent for Python-native teams but does not have the same depth of Kubernetes integration. Argo Workflows is closer (Kubeflow Pipelines actually uses Argo Workflows under the hood) but requires more manual configuration for ML-specific use cases.

Kubeflow Pipelines wins because it treats ML artifacts as first-class citizens. Input datasets, output models, and evaluation metrics are tracked automatically per run with a lineage graph. When a model in production degrades, you trace it back to the exact run, the exact data version, and the exact hyperparameters.

Where it falls short: The Python SDK has a learning curve and the documentation has not kept pace with API changes between v1 and v2. Plan for a few days of setup before things feel natural.

MLflow for experiment tracking and model registry

What it does: MLflow tracks parameters, metrics, and artifacts during training runs. The model registry manages versioning and lifecycle from staging to production.

Why we chose it: We considered Weights and Biases and Neptune. Both are excellent, particularly W&B for visualisation. But they are SaaS products with per-seat pricing that scales up fast, and they require sending training data and model metadata to an external service. That is a non-starter for clients with data residency requirements.

MLflow is self-hosted, open source, and integrates with every major framework. The model registry feature works cleanly with our Kubeflow deployment: the pipeline logs metrics to MLflow and pushes the final model artifact to the MLflow registry. KServe can read MLflow model artifacts directly if you use the correct model flavor. Training run completes, KServe picks up the model from the registry, no manual step required.

Where it falls short: The MLflow UI is functional but not beautiful. If your team cares about experiment comparison visualisation, W&B is genuinely better. For production deployments where the UI is not the daily driver, MLflow is the right call.

ArgoCD for GitOps deployment

What it does: ArgoCD syncs Kubernetes cluster state with a Git repository. You define what should be running in Git, and ArgoCD ensures the cluster matches it.

Why we chose it: For ML infrastructure specifically, every configuration change is a Git commit. A new model version goes from staging to canary to production. Each transition is a configuration change. With ArgoCD, every change is auditable, reversible, and reviewable via pull request. The review and merge process becomes the deployment approval process.

Where it falls short: ArgoCD adds operational overhead. You maintain a GitOps repository structure, manage sync policies, and occasionally manually intervene when drift occurs. For teams not already comfortable with Kubernetes YAML, the initial setup takes time.

Prometheus and Grafana for infrastructure monitoring

What it does: Prometheus scrapes metrics from infrastructure and applications. Grafana visualises them and evaluates alert rules.

Why we chose it: The kube-prometheus-stack Helm chart gives you Prometheus, Alertmanager, Grafana, and pre-built Kubernetes dashboards in a single deployment. We extend it with the NVIDIA DCGM Exporter for GPU metrics per pod and custom metrics exported from Kubeflow Pipelines and MLflow. One Grafana instance shows cluster health, GPU utilisation per training job, pipeline execution times, and model serving latency.

The dashboard we build for every ML client: an overlay of training job runs, deployments, and production anomalies on a shared time axis. When a model in production starts degrading, the first question is "what changed?" This dashboard makes that answerable in under a minute.

Where it falls short: Prometheus is a pull-based system with local storage. For long-term metric retention, configure remote write to Thanos or VictoriaMetrics. The default 15-day retention is enough for operational monitoring but not for ML trend analysis over months.

Tools we evaluated and did not pick

Tool	Why we evaluated it	Why we did not pick it
Airflow	Most widely used data pipeline tool	Not designed for ML. ML features feel grafted on.
Weights and Biases	Best visualisation in the category	SaaS pricing scales fast. Data residency issues for some clients.
Seldon Core	Rich A/B testing and multi-arm bandit features	KServe is more actively maintained with better Kubernetes operator integration.
Feast (feature store)	Good for complex feature engineering shared across models	Overhead not justified for most clients we work with.
DVC	Dataset versioning	Good tool, but for clients with S3 and versioning enabled, the overhead is not justified.

The right MLOps toolchain depends on your team size, data residency requirements, and how much operational overhead you can absorb. The stack above works well for production ML at startup and mid-market scale. Our MLOps service covers the full setup from Kubeflow deployment to KServe inference configuration and ongoing management.

If you are evaluating MLOps toolchains or struggling with an existing setup, book a free 30-minute call.

MLOps Tools We Actually Use in Production and Why We Picked Them