This engagement is under NDA. The client is a research lab at a prestigious UK university. What we can say: Eprecisio deployed a production-grade on-premises Kubeflow platform, has operated it for two years with 100% uptime, performed one major version upgrade without disrupting active research workloads, and manages the full infrastructure layer so data scientists can focus on their work.
What the lab needed
The lab runs ML research at scale. Data scientists need to train models on large datasets, track experiments, manage pipelines, and collaborate across research teams. The requirement was a self-hosted ML platform that could do all of this on the lab's own infrastructure, without research data transiting external cloud services.
The specific requirements shaped every architectural decision:
| Requirement | Detail | Why it mattered |
|---|---|---|
| On-premises hosting | All compute and data on university-controlled infrastructure | Research data under NDA and data sharing agreements cannot leave controlled infrastructure |
| Azure SSO integration | Single sign-on through the university's existing Azure AD | Researchers use university credentials; no separate account management |
| Custom branding | Platform skinned to the institution's identity | Lab-specific deployment, not a generic Kubeflow instance |
| Data replication strategy | Persistent storage with replication for research databases | Research outputs, model artifacts, and experiment data must be protected against storage failure |
| Zero-friction for researchers | Data scientists access the platform and run work without infrastructure knowledge | The platform exists to remove infrastructure as a barrier to research |
| Long-term stability | Two-plus years of reliable operation | Research projects run for months or years; disruption costs more than it would in a typical product environment |
The deployment
Kubeflow was deployed on the lab's on-premises Kubernetes cluster. The full Kubeflow component set was configured: Pipelines for workflow orchestration, Notebooks for interactive development, KFServing for model serving, and the central dashboard as the researcher-facing interface.
Authentication with Azure and Dex. The university runs Azure Active Directory for identity management. Researchers authenticate to every university system with their institutional credentials. Connecting Kubeflow to that identity layer required configuring Dex as the OIDC connector between Kubeflow's authentication layer and Azure AD. This meant researchers could access the ML platform with the same login they use for every other university system, no separate account, no password to manage.
Custom branding. The Kubeflow interface was branded to the institution. The dashboard, login page, and documentation surfaces reflect the lab's identity rather than the generic Kubeflow upstream. For a research institution, the tooling presented to researchers and external collaborators carries the institution's credibility.
Data replication for research databases. Research databases hold experiment logs, model checkpoints, pipeline artifacts, and dataset versions. The storage layer was configured with replication to protect against single-node storage failure. For a research environment where a corrupted or lost dataset might represent months of collection work, storage reliability is not optional.
| Component | Configuration | Purpose |
|---|---|---|
| Kubeflow Pipelines | On-prem Kubernetes deployment | ML workflow orchestration, reproducible research pipelines |
| Kubeflow Notebooks | JupyterHub-based, resource-managed | Interactive model development for data scientists |
| KFServing | On-prem model serving | Deploy trained models for inference without infrastructure work |
| Dex OIDC | Configured against university Azure AD | Single sign-on with institutional credentials |
| Custom branding | Dashboard, login page, documentation | Institution-specific interface |
| PostgreSQL with replication | Persistent storage for all Kubeflow state | Research data, pipeline metadata, experiment artifacts |
| Kubernetes cluster | On-prem, managed by Eprecisio | Workload management, resource scheduling, automated recovery |
Two years of operations
The measure of a production deployment in a research environment is not the launch. It is what happens over the following months and years as the platform becomes critical infrastructure for active research.
Kubeflow has been running on this infrastructure for two years. In that time there have been no unplanned outages. Data scientists have been able to train and run models without infrastructure-related interruptions to their work.
One major version upgrade was performed over the two-year period. Kubeflow version upgrades involve changes to the underlying Kubernetes operators, the pipeline backend, and the notebook configurations. Doing this without disrupting active research workloads required careful planning: a staging upgrade to validate the new version, a maintenance window coordinated with the research teams, and a rollback plan that was tested before the upgrade began. The upgrade completed without disrupting any active notebooks, pipelines, or model serving deployments.
What was hard
Kubeflow on-prem storage for long-running research workloads. Research pipelines run for hours or days. Checkpoint storage, intermediate outputs, and experiment artifacts accumulate. Configuring persistent volume management that handled this gracefully, without filling storage unexpectedly or losing data on pod restarts, required more careful tuning than a typical application deployment.
Dex and Azure AD integration edge cases. University Azure AD tenants have specific configurations: group membership policies, conditional access rules, and token lifetimes that differ from a standard Azure deployment. Getting Dex to correctly pass group membership claims through to Kubeflow so that multi-tenancy and namespace isolation worked correctly for different research teams took careful configuration and testing with real university accounts.
Upgrading without disrupting active research. Research workloads do not pause conveniently. Data scientists run training jobs that last days. Coordinating a major version upgrade around active workloads required mapping every running job and notebook, understanding which could be safely interrupted and which could not, and planning the maintenance sequence accordingly.
Results
| Metric | Outcome |
|---|---|
| Uptime | 100% over 2 years of operation |
| Version upgrades | 1 major upgrade completed without disrupting active workloads |
| Authentication | Azure SSO via Dex, researchers use institutional credentials |
| Data sovereignty | All research data and model artifacts on university-controlled infrastructure |
| Researcher experience | Train and run ML models without infrastructure knowledge or IT involvement |
| Storage | Replicated research databases, protected against storage failure |
| Branding | Custom institutional identity across all researcher-facing surfaces |
What changed for the research team
Before this platform, data scientists needed IT involvement to provision compute for training jobs, access was managed through separate credentials, and there was no unified interface for managing experiments, pipelines, and model deployments. Research infrastructure was a recurring friction point.
Two years later, the lab's data scientists open a browser, authenticate with their university credentials, launch a notebook or pipeline, and run their work. The infrastructure layer is invisible to them. That is what the platform was built to achieve, and it has operated that way since deployment.
For on-premises ML infrastructure built for long-term stability in research or regulated environments, see our MLOps service.
If you need a production ML platform on your own infrastructure with zero data sovereignty compromises, book a free 30-minute call.