Research / Education
MLOps · InfraOps · Platform Engineering · DevSecOps

On-Prem Kubeflow for University Research: 2 Years, Zero Downtime

Published April 6, 2026

100%
Uptime Over 2 Years
1
Version Upgrade, Zero Disruption
On-Prem
Data Sovereignty
Azure SSO
University Auth Integration
KubeflowKubernetesAzure SSODexPostgreSQLTerraformCustom Branding

University research labs have infrastructure requirements that sit between enterprise IT and startup engineering. The data is sensitive: research data under NDA, patient cohort data for health studies, proprietary datasets from industry partnerships. The users are data scientists who need to train and run large ML models without filing IT tickets. And the platform has to operate reliably for years, not just through an initial deployment.

This engagement is under NDA. The client is a research lab at a prestigious UK university. What we can say: Eprecisio deployed a production-grade on-premises Kubeflow platform, has operated it for two years with 100% uptime, performed one major version upgrade without disrupting active research workloads, and manages the full infrastructure layer so data scientists can focus on their work.

What the lab needed

The lab runs ML research at scale. Data scientists need to train models on large datasets, track experiments, manage pipelines, and collaborate across research teams. The requirement was a self-hosted ML platform that could do all of this on the lab's own infrastructure, without research data transiting external cloud services.

The specific requirements shaped every architectural decision:

RequirementDetailWhy it mattered
On-premises hostingAll compute and data on university-controlled infrastructureResearch data under NDA and data sharing agreements cannot leave controlled infrastructure
Azure SSO integrationSingle sign-on through the university's existing Azure ADResearchers use university credentials; no separate account management
Custom brandingPlatform skinned to the institution's identityLab-specific deployment, not a generic Kubeflow instance
Data replication strategyPersistent storage with replication for research databasesResearch outputs, model artifacts, and experiment data must be protected against storage failure
Zero-friction for researchersData scientists access the platform and run work without infrastructure knowledgeThe platform exists to remove infrastructure as a barrier to research
Long-term stabilityTwo-plus years of reliable operationResearch projects run for months or years; disruption costs more than it would in a typical product environment

The deployment

Kubeflow was deployed on the lab's on-premises Kubernetes cluster. The full Kubeflow component set was configured: Pipelines for workflow orchestration, Notebooks for interactive development, KFServing for model serving, and the central dashboard as the researcher-facing interface.

Authentication with Azure and Dex. The university runs Azure Active Directory for identity management. Researchers authenticate to every university system with their institutional credentials. Connecting Kubeflow to that identity layer required configuring Dex as the OIDC connector between Kubeflow's authentication layer and Azure AD. This meant researchers could access the ML platform with the same login they use for every other university system, no separate account, no password to manage.

Custom branding. The Kubeflow interface was branded to the institution. The dashboard, login page, and documentation surfaces reflect the lab's identity rather than the generic Kubeflow upstream. For a research institution, the tooling presented to researchers and external collaborators carries the institution's credibility.

Data replication for research databases. Research databases hold experiment logs, model checkpoints, pipeline artifacts, and dataset versions. The storage layer was configured with replication to protect against single-node storage failure. For a research environment where a corrupted or lost dataset might represent months of collection work, storage reliability is not optional.

ComponentConfigurationPurpose
Kubeflow PipelinesOn-prem Kubernetes deploymentML workflow orchestration, reproducible research pipelines
Kubeflow NotebooksJupyterHub-based, resource-managedInteractive model development for data scientists
KFServingOn-prem model servingDeploy trained models for inference without infrastructure work
Dex OIDCConfigured against university Azure ADSingle sign-on with institutional credentials
Custom brandingDashboard, login page, documentationInstitution-specific interface
PostgreSQL with replicationPersistent storage for all Kubeflow stateResearch data, pipeline metadata, experiment artifacts
Kubernetes clusterOn-prem, managed by EprecisioWorkload management, resource scheduling, automated recovery

Two years of operations

The measure of a production deployment in a research environment is not the launch. It is what happens over the following months and years as the platform becomes critical infrastructure for active research.

Kubeflow has been running on this infrastructure for two years. In that time there have been no unplanned outages. Data scientists have been able to train and run models without infrastructure-related interruptions to their work.

One major version upgrade was performed over the two-year period. Kubeflow version upgrades involve changes to the underlying Kubernetes operators, the pipeline backend, and the notebook configurations. Doing this without disrupting active research workloads required careful planning: a staging upgrade to validate the new version, a maintenance window coordinated with the research teams, and a rollback plan that was tested before the upgrade began. The upgrade completed without disrupting any active notebooks, pipelines, or model serving deployments.

What was hard

Kubeflow on-prem storage for long-running research workloads. Research pipelines run for hours or days. Checkpoint storage, intermediate outputs, and experiment artifacts accumulate. Configuring persistent volume management that handled this gracefully, without filling storage unexpectedly or losing data on pod restarts, required more careful tuning than a typical application deployment.

Dex and Azure AD integration edge cases. University Azure AD tenants have specific configurations: group membership policies, conditional access rules, and token lifetimes that differ from a standard Azure deployment. Getting Dex to correctly pass group membership claims through to Kubeflow so that multi-tenancy and namespace isolation worked correctly for different research teams took careful configuration and testing with real university accounts.

Upgrading without disrupting active research. Research workloads do not pause conveniently. Data scientists run training jobs that last days. Coordinating a major version upgrade around active workloads required mapping every running job and notebook, understanding which could be safely interrupted and which could not, and planning the maintenance sequence accordingly.

Results

MetricOutcome
Uptime100% over 2 years of operation
Version upgrades1 major upgrade completed without disrupting active workloads
AuthenticationAzure SSO via Dex, researchers use institutional credentials
Data sovereigntyAll research data and model artifacts on university-controlled infrastructure
Researcher experienceTrain and run ML models without infrastructure knowledge or IT involvement
StorageReplicated research databases, protected against storage failure
BrandingCustom institutional identity across all researcher-facing surfaces

What changed for the research team

Before this platform, data scientists needed IT involvement to provision compute for training jobs, access was managed through separate credentials, and there was no unified interface for managing experiments, pipelines, and model deployments. Research infrastructure was a recurring friction point.

Two years later, the lab's data scientists open a browser, authenticate with their university credentials, launch a notebook or pipeline, and run their work. The infrastructure layer is invisible to them. That is what the platform was built to achieve, and it has operated that way since deployment.

For on-premises ML infrastructure built for long-term stability in research or regulated environments, see our MLOps service.

If you need a production ML platform on your own infrastructure with zero data sovereignty compromises, book a free 30-minute call.

Want Similar Results for Your Business?

Let's discuss how Eprecisio can help you achieve your goals.

Book a Free 30-Min Call

Your infra shouldn't be the thing slowing you down.

Book a free 30-minute call. We'll look at your current setup and tell you exactly what's costing you money, what's a deployment risk, and what we'd fix first. No pitch, no fluff.

AWSAzureGCPKubernetesDockerTerraformPythonReactNext.jsArgoCDPrometheusGrafana