Kratix Architecture: Antifragile by Design

Abby Bangser
Sep 11
5 min read

Most internal platforms don’t fail on day one: they fail on day 2, 200, or 2000. What starts as a solid solution to today’s needs often turns into tomorrow’s bottleneck: brittle delivery processes, endless ticket queues, and systems so complex they collapse under their own weight.

At Syntasso, we built Kratix to flip this pattern. After years of running platforms across various industries, we observed the same recurring issues: brittle delivery processes, slow-moving teams, unclear responsibilities, and systems that decayed under their own complexity.

Unlike platforms that deteriorate and become riskier over time, Kratix is designed to strengthen with every change, making it an antifragile platform that thrives on evolution.

How We Have Stood on the Shoulders of Platform Giants

The history of distributed system orchestration shows a clear path towards more powerful and successful platform design.

Google’s Borg

Google’s Borg proved that a single platform could orchestrate thousands of workloads across tens of thousands of machines. It brought together declarative job specifications, workload isolation, self-healing, and automated rolling upgrades, which let Google evolve infrastructure and services continuously without disruption.

Borg abstracted physical machines into a single logical cluster, enabling massive multi-tenancy with strong isolation and reliable fleet management.

Cloud Foundry and Kubernetes

This approach influenced a generation of systems. In very close succession, Cloud Foundry and Kubernetes both extended and open-sourced these ideas. Cloud Foundry focused on providing a streamlined Platform-as-a-Service (PaaS) for enterprises, while Kubernetes became the vendor-neutral solution for compute scheduling.

Each also extended beyond scheduling to support Borg-like operational discipline, including rolling upgrades, health monitoring, and resource orchestration to the broader distributed systems world, Cloud Foundry through the tool BOSH and Kubernetes through the broader CNCF landscape.

Kratix

Kratix continues this lineage. Where Borg and BOSH orchestrated workloads and infrastructure, Kratix gives platform teams a framework to deliver self-service capabilities as Promises that provision infrastructure, apply compliance, or configure services automatically.

Like Borg abstracted fleets into a logical cluster, Kratix creates consistent APIs in front of any organisational capability, which allows any instances to be managed as a fleet. While this can include low-level infrastructure such as servers, it is even more powerful when applied to higher-level capabilities such as test environments or complex distributed services. Kratix thereby applies lessons from decades of orchestration to today’s heterogeneous enterprise platforms.

The Impact of Fragility: How to Avoid Platform Decay

Orchestration history indicates that scale alone is insufficient. Systems must evolve safely and continuously. Platforms become fragile when they centralise control into bottlenecks, couple components so tightly that failure cascades, or design for stability without planning for change. Neglecting “day 2” operations only accelerates decay.

Antifragile platforms take the opposite approach: failure is feedback, change is expected, and long-term value comes from constant evolution. Borg and BOSH demonstrated that automated upgrades, policy-driven scheduling, and separation of concerns make large systems more reliable over time. Kratix applies the same philosophy to the broader organisational platform, embedding these patterns so they work at the scale of the enterprise, not just the cluster.

Antifragility in Action with Kratix

Platforms are a technical solution to a business problem. Borg focused on unifying resource abstractions and scheduling to unlock Google’s cloud computing productivity. Kratix’s architecture applies the same principles to the higher-level challenge of platforms that solve business process challenges such as developer workflows, data science analysis, and testing exercises.

Compliant Capabilities: Leveraging Multi-player Mode

DevOps decentralised operational knowledge to speed delivery, but supporting large stacks slowed many organisations over time; compliance, security, and performance benefit from centralisation.

Kratix enables this “multi-player mode” by giving each stakeholder what they need. Platform engineers set guardrails for capability delivery, Ops and SRE teams manage infrastructure reliability, developers request services via declarative APIs, and security teams embed compliance into automated, auditable pipelines. Clear boundaries prevent collisions and allow each group to focus on its role, keeping teams operating independently yet coherently.

Efficient Operations: Manage Fleets, Not Instances

Day one is easy; day 2000 is the test. Platform value extends beyond scheduling jobs to include managing upgrades, rescheduling workloads, and integrating new hardware across a fleet without downtime.

Kratix brings this operational discipline to enterprise platforms. Promises and pipelines are versioned independently, rollout rules remove drift, and resilience features like retries and failure isolation come as standard. Platform-wide observability, lifecycle management, and recovery tooling ensure that as capabilities evolve, they do so predictably and without service disruption, just as Borg did for infrastructure and BOSH did for applications.

Unified Experience: Bridging Legacy, Kubernetes, and Beyond

Every organisation is brownfield. Kubernetes solved part of the orchestration puzzle, but its power is in being a substrate rather than a complete platform. Kratix utilises Kubernetes’ mature APIs for scheduling and resource management, while extending the model to handle VMs, mainframes, serverless, and emerging technologies.

By providing consistent lifecycle and compliance management across diverse workloads, Kratix lets organisations focus on change management, ownership boundaries, scale, and recovery without increasing complexity.

Scalable Teams: Enabling Flow of Value

Scaling a platform means scaling people. Kratix separated platform logic from workload ownership and infrastructure management, allowing teams to operate independently with shared governance.

Platform engineers define frameworks and guardrails; Ops and SREs ensure operational reliability. Developers consume APIs without ticket queues, and security codifies controls into reusable pipelines.

By formalising these interactions as versioned contracts, Kratix reduces friction, maintains high visibility, and enables teams to add or change capabilities without waiting for others. This ensures that value flows both on the platform (as capabilities improve) and within the platform (as they are utilised).

Product Experience Shaped by Implementation Principles

While distributed orchestration may be the vision for the platform engineering of the future, it requires the support of lower-level principles to achieve.

The key principles that all great platforms need to maintain can also be applied to the tools behind the platform. By building Kratix with these principles, the platforms built on top of Kratix can also be encouraged to maintain the same principles with ease.

Minimise waste – Only differentiated code builds value. Own interfaces and, where possible, implement with proven and emerging tools.

Design with modularity and loose coupling – Increasing scope is inevitable. Systems need to support independent yet cohesive extension with clear contracts and integration points.

Speed of learning – Design for small, testable changes, fast feedback, and continuous delivery, keeping developer and operator experience central. Validation needs to be applied pragmatically at all levels of abstraction.

Be resilient by default – Failure is expected. Workloads are isolated, automated recovery is enabled, mutable state is minimised, and any necessary state is managed externally.

How to Know if Your Platform Is Antifragile

When evaluating a platform, the orchestration lessons applied from Borg to Kratix remain true: it should fail safely, evolve without disruption, scale across teams, and be extendable without rewrites. Applying these lessons supports easy growth through intuitive onboarding while investing in making day 2 operations first-class.

Whether you’re reviewing your current solution or thinking of building or buying a new platform option, you need to ask yourself:

Can it fail safely, or does failure cascade?
Can multiple teams deliver changes independently?
Is the Platform evolvable under real-world usage?
Can you extend it without core rewrites?
Are Day 2 operations first-class?
Can you onboard new users without retraining everyone?

If the answer is “no,” you may have a robust system, but not an antifragile one.

Building Platforms That Thrive

Borg showed what was possible at Google scale. Kubernetes, Cloud Foundry, and BOSH brought those patterns to the wider world. Kratix is the next step, taking the automation, separation of concerns, and lifecycle management of its predecessors and applying them to modern, multi-cloud, multi-tenant enterprise platforms.

We built Kratix to make platforms antifragile: to become faster, safer, and more valuable with every use, regardless of how much they grow or change.

See how Kratix enables antifragile platforms: