Building a Platform Framework: Lessons Learned from Developing a Multi-Cluster Kubernetes Operator

Ntongha Ekot
Jan 20
5 min read

Cat Morris and Jake Klein’s KubeCon EU 2025 talk in London is a practical walkthrough of what happens when a “simple” Kubernetes setup grows into a full platform estate, and an exploration of the challenges that working with multiple clusters brings. They frame the problem with a familiar story: platform complexity rarely appears overnight. It creeps in, one sensible decision at a time.

They begin with a fictional startup called AI’m a little teapot, where the platform team’s job is to help developers deploy workloads and databases.

At first, one Kubernetes cluster is enough. Then production and development need separate clusters. Then an outage pushes the team toward availability zones and replicas. Then the business expands into the US with different regulatory requirements, so clusters now span regions. Next, an ML team brings Azure into the mix, and suddenly, it is multi-cloud. Finally, an acquisition adds on-prem Kubernetes into the mix and the claim that this should be easy to manage, as it is “just Kubernetes.”

By the time you look up, Cat and Jake argue, the platform is no longer “a cluster” but a fleet. And it is not even just Kubernetes anymore. It is also VMs, databases, edge compute, Terraform, Puppet, Ansible, and multiple orchestration approaches. Cat asks the audience who has experienced this slow build-up of complexity, and roughly half the room raises their hands. Her response is equal parts sympathetic and resigned: “Awesome… I’m sorry.”

This, they explain, is why they built Kratix, an open source platform framework to help teams build better platforms. The rest of the talk is a set of lessons learned from developing one of Kratix’s most challenging components: multi-cluster orchestration.

Don’t reinvent the wheel; use the ecosystem

Cat’s first message is simple: don’t rebuild what already exists. The Kubernetes ecosystem is vast and full of tools that already solve common platform problems. People that built these tools have made the mistakes so you don’t have to.

She also highlights a key risk: platform teams often try to do everything themselves. They build bespoke interfaces, reinvent CI/CD patterns, or stitch together too many moving parts. The result is a platform that grows into something unmaintainable. Instead, she argues that teams should leverage the ecosystem for what it does well and focus engineering efforts on their differentiator.

For Kratix, that differentiator is orchestration. Cat introduces the Kratix concept of a Promise, their way of providing anything-as-a-service. A Promise includes the API users interact with, dependencies, workflows, and rules about where it ends up. Importantly, it is expressed as YAML and a CRD, so it fits naturally into Kubernetes rather than forcing a separate model.

Multi-cluster gets complicated fast

Jake explains why multi-cluster is inevitable as adoption grows. Developers get value from Kubernetes quickly by installing operators (he uses Kubeflow as an example), but as more people and teams adopt it, a “cluster per person” model collapses. Teams go multi-tenant, install different stacks, and start encountering the usual problems: conflicting requirements, hard upgrades, and accidental deletion of each other’s resources.

His conclusion is that developers should not be given direct access to Kubernetes for these workflows. They should interact with a platform, and the platform should manage the complexity.

Their goal was to enable a platform engineer to provide something like “Kubeflow as a service” using a Promise, and have Kratix schedule it to the right clusters based on requirements such as Kubernetes version or GPU nodes. Then, when users request training, the platform can choose the right cluster without the user needing to know anything about the fleet.

GitOps became the design choice that made multi-cluster practical

Jake says the simplest approach is to give a central platform credentials to talk to all clusters via the API, but Kratix avoided this because it assumes connectivity and centralised credentials. That is not realistic for edge clusters, airgapped environments, or hybrid and multi-cloud estates.

Instead, they leaned into GitOps: Kratix writes the desired state to Git repositories or S3 buckets, and clusters converge on that state using the tool of their choice (Argo, Flux, custom loops, or even manual file transfer). To enable scheduling, they reused Kubernetes-style labels at the cluster level, allowing the platform to filter clusters by region, GPUs, versions, and other capabilities.

This worked well for simple cases, but Jake notes they quickly ran into more realistic requirements. One example was sending a training job to an expensive GPU cluster while sending dashboards to a cheaper cluster. That forced them to evolve their scheduling to support multiple documents to multiple destinations simultaneously.

They also realised the GitOps pattern applies beyond Kubernetes. Terraform, Pulumi, Backstage, and other systems can converge from Git-backed state, so Kratix scheduling became more open-ended, supporting workflows across multiple tools, and not just clusters.

Reduce complexity in the parts you control

Once they adopted GitOps-style communication, Jake explains, they faced another challenge: how to get information back from remote clusters without introducing agents pushing to a central API. Rather than adding new connectivity assumptions, they reused the same pattern to communicate back through the same state store. The lesson, he argues, is to reduce complexity wherever you control the design.

A product lesson: prioritisation matters as much as engineering

Cat shares a cautionary tale about a versioning feature. A customer asked about upgrading Promises and managing dependency versions. The team went deep on design, expanding the scope into “boxes and boxes” of planning. They ultimately spent five weeks across design and execution, shipped the feature, and then heard nothing for twelve months. When feedback finally arrived, they struggled to remember the intent and reasoning behind earlier decisions.

Cat’s takeaway was that neither extreme works. Shipping something instantly can result in lots of small, hard-to-maintain features. Over-designing before shipping delays learning. Their CEO’s quote became their anchor: “Life is a series of prioritisation exercises.” User needs matter, but they must be prioritised against wider goals, and teams should remember why they started building in the first place.

The biggest lesson: never assume your users know Kubernetes

Cat closes with the strongest message in the talk. Kratix extends Kubernetes with CRDs and operators, and they assumed users would understand those primitives. In reality, many organisations are still early in their Kubernetes journey and do not have deep expertise. If you assume knowledge of Kubernetes, you risk baking in complexity and limiting adoption.

Multi-cluster complexity is becoming the norm, not the edge case. To go deeper:

→ Watch the full KubeCon EU talk

→ See how Kratix supports scalable fleet operations, multi-stakeholder workflows, and hybrid environments.

→ Explore how to orchestrate and deploy services across many clusters and environments.

→ Get started with Kratix