top of page

A Guide to Fleet Management with Kratix

When embracing cloud and container technologies, it’s tempting to think of fleet management as a challenge for tomorrow. But the reality is that the majority of organisations quickly bump into the need to run multiple VMs, container, and Kubernetes clusters. Often, this is triggered by the organisation seeing success and the business growing rapidly, but it can equally be triggered by increasing regulations and the need for compliance.


When running a set of services in a multi-tenant environment, teams often isolate workloads using namespaces and network policies in Kubernetes or similar mechanisms like user permissions or RBAC on other platforms. However, for teams operating under stricter compliance standards, this approach may not suffice. To maintain clear boundaries and meet audit requirements, a more common practice is to allocate a dedicated cluster per application or team.


The end result of this is usually the same: a fleet of services that run on different machines. Standing up infrastructure is the easy part; orchestrating deployments and ensuring consistency across all those clusters is the real challenge.


This is where Kratix comes in. In this guide, we’ll explore how Kratix enables you to orchestrate and deploy services seamlessly across a fleet. We will use Kubernetes clusters for our example, but Kratix also enables managing platform services across practically all types of infrastructure.


What is fleet management in platform engineering

As platform engineering gains more ground, focus has shifted from the infrastructure to the experience of delivering software. This is typically achieved by giving developers the autonomy to provision and deploy their applications and required platform dependencies without the dreaded TicketOps experience. 


Fleet management, as it relates to platform engineering, moves the focus from treating your platform services and infrastructure as pets towards treating them as cattle


Why fleet management matters

When the smoke clears and your infrastructure is all set up, the natural next step (day 2 operations) is figuring out how to orchestrate deploying your applications while factoring in individual team requests like, “we need a database running alongside this app”, or “this should only run in the EU region."


A number of tools exist to handle these cases individually, but stitching them all together presents the challenge.


In the absence of fleet management in your platform, you risk running into a variation of cluster sprawl, where some teams run a cluster per project or per region, while others do shared clusters. Day 2 operations become even harder as everyone has a different way of doing things. In such a case, you end up treating not just your infrastructure but also your applications as pets, which platform engineering was meant to save you from in the first place.


How Kratix approaches fleet management 

Kratix approaches the fleet management problem by acting as a translation layer between the developer's intent and where it will run. It does so via Promises.


Promises provide platform users with an organisation-specific API for expressing the details of deploying your app and the dependencies it needs to run. 


For developers, this could mean writing or using a Promise to deploy their app and specifying another Promise, this time for a relational database on AWS as a dependency. The developer doesn't need to know how the database is provisioned; however, they know they will have a guarantee that the database will exist before their app runs. 


Fleet-level orchestration means updating one service equals updating all environments (or a selective subset of environments if required). Make a change to a Promise, and that change propagates across your entire fleet without needing to manually sync each cluster or maintain separate configurations for each environment.


This is where Kratix diverges from tools like ArgoCD. ArgoCD is good at syncing resources, but it's not built for full platform orchestration. GitOps is the most intuitive solution, but it's not the full story. When you leverage just GitOps, you end up with what Daniel Bryant calls the "portals and pipelines" antipattern: where you're stitching together a combination of GitOps tooling and a few scripts or an operator to reconcile external resources or create additional resources.


This problem isn't so apparent until you find yourself Googling "multi-cluster ArgoCD" to orchestrate a set of apps across clusters, even though you have your portals all set up. At that point, you're back to managing the orchestration layer yourself, which is exactly what fleet management is supposed to solve. 


Demo overview 

Fleet management with Kratix is quite involved, as it is built to scale with any number of clusters you might add.  For this demo, we will simulate how you would update a service across multiple clusters.


How this will work:

  1. Promise installation: Operators install Promises on the platform cluster, which creates custom CRDs (e.g., rabbitmq)

  2. Developer request: Developers create resource requests using the exposed CRDs, specifying their requirements in YAML files.

  3. Kratix processing: The platform cluster processes the request and generates the necessary Kubernetes manifests.

  4. Git state management: Processed manifests are committed to the Git repository under destination-specific paths(more on this shortly).

  5. Flux synchronisation: Worker clusters run Flux operators that continuously sync from their designated StateStores.

  6. Workload deployment: Flux deploys the synchronised manifests as actual workloads on the target worker clusters.


Here is a visual illustration of the setup you will be replicating.


Fleet management with Kratix
Fleet management with Kratix

Prerequisites

Before you can follow along with this guide, make sure you have the following tools and setup ready:

  1. Kubectl: The Kubernetes command-line tool for managing and interacting with your clusters.

  2. Kubernetes clusters: You’ll need at least two active Kubernetes clusters.

  3. Git repository access (GitHub account): Required to host the Kratix state store and manage configuration through GitOps.

  4. Docker (for KinD users): A container runtime needed to run KinD clusters locally.


NOTE: This demo was tested using Civo Cloud with clusters in the US, London, and Frankfurt regions. If you don’t have access to managed clusters, you can use local tools like KinD or Minikube. However, a managed provider is recommended for the most realistic, production-like results.


Repository setup

Kratix uses GitOps to provision and keep your services in sync through one of the supported GitOps agents installed on target destinations. For this demo, Kubernetes clusters will be used, and this takes care of keeping services in sync. 


By attaching destinations to a StateStore, GitOps agents know what resources to sync to a destination.  At the time of writing, a StateStore can be one of two: a GitStateStore, which is your standard git repository on a provider like GitHub, or a BucketStateStore, which is any S3-compatible storage. 


For this demo, we will be using a GitStateStore as it mirrors the day-to-day workflow you might already be used to. 


Generate  a key pair 

In order for the GitOps agent to access your repository securely, you will need to generate an SSH key pair. 


To do this, run the following command in your terminal:


This will create a public and private key pair in your current directory kratix-deploy and kratix-deploy.pub 


Step 1: Create a deploy key

Deploy keys allow external infrastructure, like a GitOps agent, to authenticate with your repository via SSH. This is great not only as a security measure, but also if you are working with a private repository. 


To add the keys you just generated, head back to your repository and navigate to Settings > Deploy keys > Add deploy key.


ree

Paste in the public key you generated earlier, and be sure to enable write access, as Kratix will need to make commits to the repository.  


Step 2: Set up the platform cluster 

As illustrated in the diagram earlier, the platform cluster is what coordinates all deployments to each destination; as such, you will want to set this up first. 


Export the platform cluster context

Run the following command to export the kube-context of your platform cluster


Be sure to replace with the name of the Kubernetes cluster you intend to use as the platform cluster. If you are unsure, run kubectl config get-contexts .


Deploy cert-manager

Kratix uses certificates to deploy its internal validating and mutating Kubernetes webhooks. By default, Kratix uses cert-manager to generate the certificates; as such, you will need to install this first.


To do this, run the command below: 


Install Kratix 

The command below will install the latest release of Kratix. 


Verify the installation is complete by running:


The output is similar to: 


Create a StateStore 

With a repository created and Kratix installed, the next step is to create a StateStore.  This will be the source of truth for all deployments, and in this demo, the StateStore will be backed by Git. 


The first step is to provide Kratix with credentials to access the repository: 


Export private SSH key:


Create and export known hosts:


Set your Git repository URL:

export GIT_REPO=git@github.com:<your-username>/<your-repo>.git


Replace <your-username>/<your-repo> with your actual GitHub repository. This is where Kratix will push all generated manifests.


Adding known hosts is SSH's way of preventing man-in-the-middle attacks. When you connect to a server via SSH for the first time, SSH stores that server's fingerprint (host key) in your ~/.ssh/known_hosts file. On subsequent connections, SSH verifies the server's identity by checking if the fingerprint matches what's stored.


Next, create a secret that will hold the GitHub credentials:


Deploy StateStore


NOTE:  The manifest above uses SSH for auth, and in secretRef uses the credentials you created in the step before. Finally, the author is set to kratix, which will show up in your Git commit history whenever Kratix pushes changes.


Kratix also supports other methods of authentication; see here for more information. 


Install GitOps agent

To sync resources from the StateStore you just configured, you will need a GitOps agent. For this demo, you will be using Flux. Head back to your terminal and run the following command while targeting your platform cluster. 



Verify that all resources are up by running:


Output is similar to:


Step 3: Register Destinations

This section will focus on connecting you to additional Kubernetes clusters to Kratix by registering them as Destinations in the platform cluster. If you just want one additional Kubernetes cluster, feel free to skip the repeated steps. 


Register infra-eu


Register infra-fra 


NOTE:  Observe the use of labels here, specifically zone: eu. Labels are used to distinguish which region each cluster is in, which becomes important when deploying workloads to specific regions. For example, you might have zone: eu for European clusters, zone: us for US clusters, and so on.


The path field (infra-eu) determines where in your Git repository Kratix will write manifests for this destination. Flux will sync from this specific path to deploy workloads to the corresponding cluster.


In stateStoreRef, you're telling Kratix to use the default GitStateStore you registered earlier. This is how Kratix knows which Git repository to push manifests to for this destination.


Configure Destinations 

With both clusters registered as destinations, Kratix is aware that it needs to write to those destinations when it receives a resource request; however, the clusters are not ready to process requests.  


To have destinations process requests, you need to install Flux on each worker cluster and configure it to sync from your Git repository.


First, export your worker cluster contexts:


Install Flux on worker clusters

For infra-eu:


For infra-fra:


Configure Flux to sync from your Git repository

For Flux to pull manifests that Kratix generates, you need to configure a GitRepository resource and provide Flux with credentials to access the repo.


First, export your SSH private key and known hosts as plain text (not base64 this time, as Flux expects raw values):


For infra-eu:

 

For infra-fra:


Configure Flux Kustomizations

Next, you need to tell Flux which paths in the Git repository to sync. Kratix writes manifests to predefined paths: ./destinations/<destination-name>/dependencies and ./destinations/<destination-name>/resources. You should follow this format for each destination.


For infra-eu:


For infra-fra:


NOTE: kratix-workload-resources depends on kratix-workload-dependencies, ensuring dependencies are deployed before the actual workloads.


Verify the setup

Check that Flux is syncing correctly on each worker cluster:


The output is similar to:


You should also be able to verify that both clusters are available as destinations on your platform cluster:


The output is similar to:


Step 4: Working with a fleet 

Ignoring the setup process for a moment, this is the typical starting point for many platform teams. You have some existing infrastructure set up, and you would like to roll out a service to a few teams running on different clusters. For this demo, that is the infra-eu and infra-fra teams. 


Additionally, you'll like to make updates to these services without manually orchestrating these deployments or waiting for a restart for changes to take effect.  Let's take a look at how to perform both operations using Kratix. 


Install a Promise

Promises are how you expose capabilities to developers on your platform. For this demo, we'll install a RabbitMQ Promise, which developers use for message passing between their services.


While targeting the platform cluster, run:


This installs the RabbitMQ Promise on your platform cluster, which creates a new custom API that developers can use to request RabbitMQ instances.


Make a resource request

Now that the Promise is installed, you can roll out RabbitMQ to both infra-eu and infra-fra by creating a resource request:


This single command triggers Kratix to provision RabbitMQ across your entire fleet.


Verify the deployment

After a few moments, you should see RabbitMQ running on your worker clusters. Check either worker cluster:


The output is similar to:


Updating a fleet

Now that developers are using RabbitMQ, compliance came back and said we should have HA enabled across our message queues. Traditionally, you would have to update all deployments individually and restart them for the effects to take place. With Kratix, you only need to make one change to your platform cluster.


In this modified manifest, the RabbitMQ operator will now provision 3 workers instead of 1. Let's see that in action.


While targeting the platform cluster, apply the updated Promise:


After a few minutes, you should notice the changes being applied to both clusters. Verify by running:


The output is similar to:


ree

Notice that you now have three RabbitMQ instances running (example-server-0, example-server-1, example-server-2). The same change was automatically propagated to infra-fra as well.


So, one update on the platform cluster equals fleet-wide consistency, with no manual orchestration required.


Closing thoughts 

Platform engineering is freeing many developers by returning autonomy and providing guardrails that eliminate hours of back and forth with a ticket. 


However, many platform teams find themselves scrambling to orchestrate the processes they have built; rolling out a change to a single service is a challenge that every team might build a custom solution for.


With Kratix, it is built in from day one, meaning after you build the platform, you have the tools to manage the lifecycle end-to-end.

Comments


bottom of page