University research labs have infrastructure requirements that sit between enterprise IT and startup engineering. The data is sensitive: research data under NDA, patient cohort data for health studies, proprietary datasets from industry partnerships. The users are data scientists who need to train and run large ML models without filing IT tickets. And the platform has to operate reliably for years, not just through an initial deployment.

This engagement is under NDA. The client is a research lab at a prestigious UK university. What we can say: Eprecisio deployed a production-grade on-premises Kubeflow platform, has operated it for two years with 100% uptime, performed one major version upgrade without disrupting active research workloads, and manages the full infrastructure layer so data scientists can focus on their work.

What the lab needed

The lab runs ML research at scale. Data scientists need to train models on large datasets, track experiments, manage pipelines, and collaborate across research teams. The requirement was a self-hosted ML platform that could do all of this on the lab's own infrastructure, without research data transiting external cloud services.

The specific requirements shaped every architectural decision:

Requirement	Detail	Why it mattered
On-premises hosting	All compute and data on university-controlled infrastructure	Research data under NDA and data sharing agreements cannot leave controlled infrastructure
Azure SSO integration	Single sign-on through the university's existing Azure AD	Researchers use university credentials; no separate account management
Custom branding	Platform skinned to the institution's identity	Lab-specific deployment, not a generic Kubeflow instance
Data replication strategy	Persistent storage with replication for research databases	Research outputs, model artifacts, and experiment data must be protected against storage failure
Zero-friction for researchers	Data scientists access the platform and run work without infrastructure knowledge	The platform exists to remove infrastructure as a barrier to research
Long-term stability	Two-plus years of reliable operation	Research projects run for months or years; disruption costs more than it would in a typical product environment

The deployment

Kubeflow was deployed on the lab's on-premises Kubernetes cluster. The full Kubeflow component set was configured: Pipelines for workflow orchestration, Notebooks for interactive development, KFServing for model serving, and the central dashboard as the researcher-facing interface.

Authentication with Azure and Dex. The university runs Azure Active Directory for identity management. Researchers authenticate to every university system with their institutional credentials. Connecting Kubeflow to that identity layer required configuring Dex as the OIDC connector between Kubeflow's authentication layer and Azure AD. This meant researchers could access the ML platform with the same login they use for every other university system, no separate account, no password to manage.

Custom branding. The Kubeflow interface was branded to the institution. The dashboard, login page, and documentation surfaces reflect the lab's identity rather than the generic Kubeflow upstream. For a research institution, the tooling presented to researchers and external collaborators carries the institution's credibility.

Data replication for research databases. Research databases hold experiment logs, model checkpoints, pipeline artifacts, and dataset versions. The storage layer was configured with replication to protect against single-node storage failure. For a research environment where a corrupted or lost dataset might represent months of collection work, storage reliability is not optional.

Component	Configuration	Purpose
Kubeflow Pipelines	On-prem Kubernetes deployment	ML workflow orchestration, reproducible research pipelines
Kubeflow Notebooks	JupyterHub-based, resource-managed	Interactive model development for data scientists
KFServing	On-prem model serving	Deploy trained models for inference without infrastructure work
Dex OIDC	Configured against university Azure AD	Single sign-on with institutional credentials
Custom branding	Dashboard, login page, documentation	Institution-specific interface
PostgreSQL with replication	Persistent storage for all Kubeflow state	Research data, pipeline metadata, experiment artifacts
Kubernetes cluster	On-prem, managed by Eprecisio	Workload management, resource scheduling, automated recovery

Two years of operations

The measure of a production deployment in a research environment is not the launch. It is what happens over the following months and years as the platform becomes critical infrastructure for active research.

Kubeflow has been running on this infrastructure for two years. In that time there have been no unplanned outages. Data scientists have been able to train and run models without infrastructure-related interruptions to their work.

One major version upgrade was performed over the two-year period. Kubeflow version upgrades involve changes to the underlying Kubernetes operators, the pipeline backend, and the notebook configurations. Doing this without disrupting active research workloads required careful planning: a staging upgrade to validate the new version, a maintenance window coordinated with the research teams, and a rollback plan that was tested before the upgrade began. The upgrade completed without disrupting any active notebooks, pipelines, or model serving deployments.

What was hard

Kubeflow on-prem storage for long-running research workloads. Research pipelines run for hours or days. Checkpoint storage, intermediate outputs, and experiment artifacts accumulate. Configuring persistent volume management that handled this gracefully, without filling storage unexpectedly or losing data on pod restarts, required more careful tuning than a typical application deployment.

Dex and Azure AD integration edge cases. University Azure AD tenants have specific configurations: group membership policies, conditional access rules, and token lifetimes that differ from a standard Azure deployment. Getting Dex to correctly pass group membership claims through to Kubeflow so that multi-tenancy and namespace isolation worked correctly for different research teams took careful configuration and testing with real university accounts.

Upgrading without disrupting active research. Research workloads do not pause conveniently. Data scientists run training jobs that last days. Coordinating a major version upgrade around active workloads required mapping every running job and notebook, understanding which could be safely interrupted and which could not, and planning the maintenance sequence accordingly.

Results

Metric	Outcome
Uptime	100% over 2 years of operation
Version upgrades	1 major upgrade completed without disrupting active workloads
Authentication	Azure SSO via Dex, researchers use institutional credentials
Data sovereignty	All research data and model artifacts on university-controlled infrastructure
Researcher experience	Train and run ML models without infrastructure knowledge or IT involvement
Storage	Replicated research databases, protected against storage failure
Branding	Custom institutional identity across all researcher-facing surfaces

What changed for the research team

Before this platform, data scientists needed IT involvement to provision compute for training jobs, access was managed through separate credentials, and there was no unified interface for managing experiments, pipelines, and model deployments. Research infrastructure was a recurring friction point.

Two years later, the lab's data scientists open a browser, authenticate with their university credentials, launch a notebook or pipeline, and run their work. The infrastructure layer is invisible to them. That is what the platform was built to achieve, and it has operated that way since deployment.

For on-premises ML infrastructure built for long-term stability in research or regulated environments, see our MLOps service.

If you need a production ML platform on your own infrastructure with zero data sovereignty compromises, book a free 30-minute call.

On-Prem Kubeflow for University Research: 2 Years, Zero Downtime