Stack8s logo
Infrastructure / SaaS
Platform Engineering · InfraOps · MLOps

Stack8s: Rebuilding a Kubernetes Platform That Actually Works

Stack8s

Published November 15, 2025

2 months
Full Platform Rebuild
5 to 6
Eprecisio Engineers
5+ years
Client Relationship
Any
Hardware or Environment
100+
Open Source Charts
Vanilla KubernetesArgoCDTerraformKubeflowVMwareGPU OperatorService MeshHelmCAST AIAI Architect Plugin
Stack8s Kubernetes automation platform dashboard showing cluster deployment interface

Stack8s is a Kubernetes automation platform built around data sovereignty. It deploys vanilla Kubernetes clusters on hardware that customers own and control, whether that is a bare metal server in their own data centre, a private cloud, or any combination. Customers link their own hardware into clusters, control which workloads run in which environment, and are never locked into a managed cloud provider's Kubernetes offering. The vision was right. The infrastructure holding it together was not. Eprecisio joined as the founding engineering partner, rebuilt the entire platform from the infrastructure layer up, and has been the core delivery team through the product's growth to its current position as a recognised player at KubeCon.

How the relationship started

The engagement did not start with a Kubernetes platform. It started with a healthcare project.

Ehtisham joined Dr. Jeremy Murray's team to work on a healthcare compliance project. The team was small, the stack was complex, and there were in-house challenges managing the infrastructure to the standard that healthcare compliance demands. Ehtisham stepped in individually to address those blockers.

That engagement built the trust that led to Stack8s. When Dr. Murray started building his vision for a Kubernetes automation product, Eprecisio was the partner he turned to. The relationship that started with one engineer on a healthcare project is now a team of 5 to 6 engineers working full-time on a product that is being presented at KubeCon.

Stack8s is a commercially ambitious product. Its differentiator is data sovereignty. Customers bring their own hardware, register it into the platform, and get production-grade Kubernetes without handing their workloads to a managed cloud provider's managed service. They control what runs where. The platform also ships a marketplace of plugins that teams can deploy directly into their clusters: AI Architect for AI workflow orchestration, Kubeflow for ML pipelines, Laravel stack integrations, and 100+ other open source tools. All of this is available through the Stack8s interface without the customer needing Kubernetes expertise. To deliver that experience credibly, the platform itself has to be faultless.

The state of the platform when active development began

Seven months ago, when the current active engagement began in earnest, the platform was failing repeatedly. Not occasionally. Continuously.

The core problem was that the architecture had accumulated instability at every layer. Networking was unreliable when customers connected their own hardware from different environments. State management did not exist in any meaningful form, so the platform had no consistent picture of what was running, what had failed, or what needed attention.

AreaState at the startImpact on customers
Platform stabilityContinuously failing, no root cause trackingCustomers could not trust clusters they provisioned
State managementNo unified state layerNode status, provisioning state, and cluster health were inconsistent across views
NetworkingUnreliable when customers connected hardware from different environmentsWorkload connectivity failed silently when hardware was registered from mixed environments
GPU managementNo operator-level control over GPU allocationML teams could not rely on GPU provisioning
MarketplaceCharts deployed inconsistently, no deployment framework100+ open source charts had no reliable install path
Customer onboardingNode registration and NACL creation unreliableNew customer setup required manual intervention
AlertingNo structured alerting or status notificationsFailures went undetected until customers reported them
PricingConnectivity issues with external cloud provider billing APIsCost data was inaccurate or unavailable

When funding came in and the product needed to scale, the architecture underneath it was not ready. The decision was made to stop patching and do a full rebuild.

The team and how the engagement evolved

The engagement grew the way most of our strongest relationships do. It started with one person, proved its value, and expanded as the scope became clear.

RoleWhat Eprecisio owns
Platform engineering leadInfrastructure architecture, Kubernetes operator design, cross-cloud networking
DevOps engineers (x2)CI/CD, cluster lifecycle management, ArgoCD GitOps, Terraform modules
Full-stack engineerPlatform UI, customer-facing APIs, marketplace frontend
Product managerRoadmap, PRDs, delivery process, sprint management
AI-native developmentAI-assisted feature development and code quality processes

This is not a vendor relationship. Eprecisio owns the roadmap process, manages delivery, writes the PRDs, and makes architecture decisions. Dr. Murray focuses on business development, partnerships, and product vision. The engineering execution is ours.

The rebuild: what we actually did

The 2-month rebuild was not a rewrite of features. It was a reconstruction of the foundation the features run on.

Infrastructure and state management layer. The platform had no consistent state model. We designed and implemented a state management architecture that tracks every cluster, node, and workload across all three cloud environments in real time. Every provisioning operation now has defined state transitions with persistence and recovery paths.

Networking for bring-your-own-hardware. Stack8s does not provision managed Kubernetes services. It deploys vanilla Kubernetes clusters on hardware that customers register from wherever that hardware lives. That means the networking layer has to handle arbitrary hardware from arbitrary environments connecting into a single control plane. We rebuilt the networking layer to handle hardware registration from any environment, normalise the connectivity model across mixed infrastructure, and maintain stable cluster networking as customers add or remove nodes from different physical or virtual locations.

GPU operator and compute management. We integrated the NVIDIA GPU Operator with custom resource allocators that give the platform real control over GPU scheduling, allocation, and monitoring across customer clusters.

Service mesh. We designed and implemented the service mesh layer for inter-cluster communication, traffic management, and observability, resolving the connectivity issues that had made the platform unpredictable.

Marketplace and plugin framework. Stack8s ships a marketplace of plugins that customers deploy directly into their clusters from within the platform. This includes AI Architect for AI workflow orchestration, Kubeflow for ML pipelines, Laravel stack integrations, and 100+ other open source Helm charts. We rebuilt the framework that governs how plugins are packaged, versioned, deployed, and updated across customer clusters, so every chart in the marketplace installs reliably regardless of what hardware the cluster is running on.

Customer onboarding infrastructure. Node registration and NACL creation for new customers were manual and error-prone. We automated the full onboarding flow so new customer environments provision without manual intervention.

ComponentWhat we rebuiltTechnology
State managementUnified state layer across all cloud providersCustom Kubernetes operators, etcd
Networking for BYOHStable cluster networking across hardware registered from any environmentVanilla Kubernetes networking, custom node registration layer
GPU managementOperator-level GPU provisioning and allocationNVIDIA GPU Operator, custom allocators
Service meshFast, stable inter-cluster communicationCustom service mesh implementation
Plugin marketplaceDeployment framework for AI Architect, Kubeflow, Laravel stack, 100+ chartsHelm, ArgoCD, custom chart operator
Customer onboardingAutomated node registration and NACL creationTerraform, Kubernetes admission controllers
AlertingStructured cluster and node health alertingPrometheus, Alertmanager
GitOps pipelineAutomated cluster lifecycle managementArgoCD, GitHub Actions
Pricing integrationReliable connectivity to cloud provider billing APIsCAST AI integration, custom billing adapters

The hardest parts

Redesigning the infrastructure layer without taking the product offline. Stack8s had paying customers during the rebuild. The platform could not simply go dark for 2 months. The approach was to build the new infrastructure layer in parallel, migrate workloads incrementally, and cut over component by component.

Networking for arbitrary hardware configurations. Because Stack8s registers customer-owned hardware rather than provisioning managed cloud nodes, the networking layer has to handle a much wider range of physical and virtual configurations. Customers were registering nodes from bare metal, from private clouds, from VMware environments, and from various provider setups. Getting the control plane to maintain stable connectivity across all of these took significantly longer than a more constrained networking model would have.

State recovery for existing clusters. When we introduced the new state management layer, existing customer clusters had no state history. Building a reconciliation process that reconstructed accurate state for live clusters without disrupting them was the most technically delicate work of the rebuild. A single error would have made existing deployments unmanageable.

Dead code and architectural debt. The AI-assisted rebuild process surfaced a significant amount of duplicate and dead code. Removing it required understanding which code was genuinely unused versus which was reached through uncommon paths not obvious from static analysis. This took longer than a clean codebase would have, but it was the right call.

Results

MetricBeforeAfter
Platform stabilityContinuously failingStable. No recurring systemic failures since rebuild.
State managementNo consistent stateReal-time state tracking across all clusters and nodes
Customer onboardingManual intervention requiredFully automated node registration and environment setup
ML setup timeWeeks of manual GPU cluster configurationHours with automated GPU provisioning
Release velocityBlocked by instabilityRegular feature releases on structured sprint cadence
Chart deploymentInconsistent, manual troubleshootingReliable across all 100+ open source charts
Team model1 embedded engineer5 to 6 engineers, PM, roadmap ownership
Product positioningPre-funding, unstable productKubeCon presence, CAST AI partnership
Dr. Jeremy Murray, Founder of Stack8s

"Their deep understanding of GPU infrastructure and MLOps made them the right choice for our project. Reduced our ML setup time by 60%."

Dr. Jeremy Murray, Founder at Stack8s

Where the product is now

Stack8s is no longer a product that is struggling to be stable. It is a product built on a clear and defensible position in the market: organisations that need production-grade Kubernetes without surrendering data sovereignty to a managed cloud provider. That means your hardware, your environment, your rules on where workloads run. Dr. Murray is now taking that product to KubeCon, presenting at Kubernetes automation working groups, and building partnerships with infrastructure players like CAST AI around it.

The Eprecisio team is not winding down. The engagement is actively growing. Dr. Murray has explicitly asked to scale the Pakistan-based engineering team further rather than continuing to hire in the UK, where previous direct hires did not work out.

For how we structure and manage production Kubernetes infrastructure at this scale, see our InfraOps service.

If you are building an infrastructure platform and need a team that can work at this level of technical depth and own the delivery process, book a free 30-minute call.

Want Similar Results for Your Business?

Let's discuss how Eprecisio can help you achieve your goals.

Book a Free 30-Min Call

Your infra shouldn't be the thing slowing you down.

Book a free 30-minute call. We'll look at your current setup and tell you exactly what's costing you money, what's a deployment risk, and what we'd fix first. No pitch, no fluff.

AWSKubernetesDockerTerraformPythonReactArgoCDPrometheus