top of page

Slow performance on M1 Macbooks leads to supporting multi-architecture Docker builds

Syntasso’s engineering team has recently been excited to test drive the new M1 Macbook chips. And it’s mainly been a positive experience but there have been a few surprises!


In particular, we spent a few days debugging our software running on Docker Desktop which uncovered an obvious (in hindsight) issue with how Docker Desktop runs containers built for architectures that do not match the host architecture.


This article will describe our experience identifying, investigating, understanding, and finally fixing the runtime errors we saw. We hope it helps those who have experience with Docker containers, but may not have worked with ARM architectures before.


Already know you are seeing cross-architecture problems? Jump to the “Fixing it” section for specific solutions.


Putting our M1’s to work


Initial local setup

An early task on the new computers was to run through our Workshop guides which provide hands-on labs for Syntasso’s Kratix. Kratix is a framework for platform teams to curate an API-driven, bespoke internal platform through a set of Kubernetes CRDs actioned by a custom controller, across one or more Kubernetes clusters.


The guides recommend KinD which allows us to create a local multi-cluster setup since each cluster is isolated to a single Docker container. Lucky for us, both Docker Desktop and KinD support M1 so we were able to run the commands as previously written.


At first, everything worked. We could bring up our base architecture including a Kratix deployment, a document repository (in our case MinIO) and a GitOps toolkit (in our case Flux) which can reconcile across multiple clusters any documents written by Kratix into the document store.



Uncovering strange behaviours

As the team continued using Kratix to install different self-serve services, they uncovered certain problems.


One engineer started identifying leader election issues with our Kubernetes controller:


    1 leaderelection.go:367] Failed to update lock: context deadline exceeded
    1 leaderelection.go:283] failed to renew lease kratix-platform-system/2743c979.kratix.io: timed out waiting for the condition
DEBUG   events  Normal  {"object": {"kind":"ConfigMap","name":"2743c979.kratix.io"}, "reason": "LeaderElection", "message": "kratix-platform-controller-manager-dc4338e66be8 stopped leading"}
DEBUG   events  Normal  {"object": {"kind":"Lease","name":"2743c979.kratix.io"}, "reason": "LeaderElection", "message": "kratix-platform-controller-manager-dc4338e66be8 stopped leading"}
INFO    Stopping and waiting for non leader election runnables
INFO    Stopping and waiting for leader election runnables
ERROR   setup   problem running manager {"error": "leader election lost"}

Another engineer raised regular restarts of the Jenkins Operator:

$ kubectl get pods                                                                       
NAME					READY	STATUS	RESTARTS	AGE
Jenkins-operator-778d6fc487-twk7d	1/1		Running	5		15m

And finally, we saw specific Jenkins instances struggle to start. They would be terminated with a readiness and/or liveness probe error multiple times before finally becoming healthy almost 20 minutes later.


Curiously, we didn’t see these behaviours across all M1 machines nor did we see them consistently on any singular machine. Despite this, the behaviour was troubling and we decided to mob on tracking down the cause.


Investigations


Like any good investigation, ours included many deadends, rabbit holes and turning points. It began with us trying to turn a collection of non-deterministic issues into a set of clear and repeatable steps to reproduce.


Where to start

We continued to see a healthy initial setup while getting mixed results from both Kratix and Jenkins after installing them but other healthy services. Here is our first crossroads: we can try to debug these unexpected restarts, or we can continue to investigate Kratix.


We knew these Jenkins Operator restarts were odd, and Jenkins should start up the same way every time when deployed in a declarative way, but we also knew that investigating Jenkins could involve investigating MinIO, Flux, Kubernetes networking, the Jenkins Operator, Jenkins runtime dependencies and more.


While we acknowledged that this is not “normal” for Jenkins, we had far more confidence we could isolate and track down the behaviour in our own software so we decided to focus on the leader election issues first.


Narrowing in on performance

With the decision to focus on Kratix leader election issues, we again installed Jenkins on different M1 computers and saw the errors inconsistently.


On further analysis of the log lines, a particular line caught our attention:

1 leaderelection.go:367] Failed to update lock: context deadline exceeded

This specific code points at library code rather than Kratix code so was not a simple answer. However, in thinking about a deadline being exceeded, we started sharing ideas on how a timeout could have knock on effects throughout the whole system. To add to the theory that performance could be our issue, the failing Jenkins instance is also getting terminated with a type of timeout.


Generally, we can assume that timeouts are more likely to happen on systems with heavy load. So why the mixed results with the different machines that should be the same spec?


This was an Aha! moment. Because of our shared debugging, we were able to identify a nuance of how we were installing Kratix and the Jenkins Promise. Sometimes we were installing these as one off commands (e.g. `kubectl apply –file samples/jenkins/jenkins-promise.yaml) whereas other times we were chain commands to install multiple promises at the same time with ease (e.g. `kubectl apply –file samples/jenkins/jenkins-promise.yaml; kubectl apply –file samples/postgres/postgres-promise.yaml).

All of a sudden we had a much higher percentage of repeatability. While still not 100%, we saw failures well over 80% of the time when running multiple installations at the same time despite the fact these should not be high cost activities.


Throwing resources at the problem

So why was Kratix working so hard when it was not being asked to do much? Surely a powerful laptop could handle a handful of parallel requests. This is when we compared the install timing on an Intel chip mac vs on an M1. Where the logs during Kratix install were being outputted incredibly slowly on the M1, the speed of logging was supersonic on the Intel. This again pointed at compute performance with the M1 Macbook as the issue.This led us to a few options:


  1. Investigate controller-runtime more deeply to understand leader election performance

  2. Turn off clustering within our controller (which was actually already off!) and also turn off leader elections

  3. Forget about performance tuning, just throw all of the Docker resources at it!


None of these options sounded great as they all seemed to tackle the M1 symptoms of slow running software rather than identifying the underlying cause. But we were finding a number of murmurings about performance for Docker Desktop on M1 so we decided to focus on that.


Identifying a solution

Expecting 32GB of RAM and 8 CPU cores for a local Docker Desktop setup is not viable for most people so we need to do something to make local testing more performant on an M1.


Our first attempt was to configure Docker Desktop experimental features, but while this was recommended by a lot of blogs, it actually made our performance worse!


Then we decided to investigate moving away from Docker Desktop. This included a look at Podman, Rancher Desktop, and Lima-vm as solutions.


Specifically, when we trialled Podman, we hit a very early wall: Kratix could not be installed due to the AMD64 architecture. This is because Podman, by default, only allows the host architecture to be used.


We saw the value in having a fully functioning Podman example in order to compare to Docker Desktop, so we built Kratix as an ARM64 image and saw significant performance improvement on Podman. Unfortunately from a commercial standpoint this felt untenable since most of our users are running Docker Desktop by default.


But wait! Rewind to how we got Podman working: we rebuilt Kratix images in the Host OS architecture. What if we did this and deployed those images on Docker Desktop?


This was the breakthrough: we realised that the container engine wasn’t the issue, the issue was Docker Desktop trying to run non-compatible architecture through an emulation layer whereas Podman by default disallows this behaviour with clear log messaging.


This gave us the clear directive that we needed to match the host computers architecture with the docker images running on it. To develop our software Kratix consistently across everyone’s machines, we would need to:


  1. Add multi-architecture support for Kratix’s images

  2. Audit images we use but do not maintain

  3. Find ways to fail fast next time


Fixing it


Multi-architecture support for Kratix’s images

As part of Kratix, we build and maintain several Docker images. Some are for the Kratix deployment itself, some are included as examples in our samples directory. Step one is adding multi-architecture support for the images we own.


In order to do that, we needed two things:

  1. Use a base image that is compatible with both AMD and ARM architectures

  2. Build and push images for both AMD and ARM architectures


Ensuring support on the base image

When the base image in the Dockerfile does not provide multi-architecture support, it’s impossible for the derived image to do so. That means we would need to scan our Dockerfile’s base images and ensure we were selecting images built for both architectures.


You can check for compatibility on Docker Hub by checking the Digests for a given tag to see if more than one OS/Arch is listed. In our case, one of our images used lyft/kustomizer as a base. We can see it only supports AMD64 architectures:


After some research, we found line/kubectl-kustomize as a replacement.


This allows us the capability to build either an AMD64 or a ARM64 final image. By default, Docker builds the image on the architecture of the machine running the build. That means if an M1 Macbook built our Dockerfile, we would end up with an ARM64 final image being pushed to Docker Hub. However, if an Intel Macbook built that same Dockerfile, we would end up with an AMD64 image.


We still needed to build our images on both architectures.


Building for multiple architectures

There are two options here:

  1. Build an image for each desirable architecture and then generate a unifying manifest before pushing (docs)

  2. Create a multi-architecture friendly custom BuildKit using the buildx command (docs)


We decided to make use of the Docker buildx CLI plugin. The `buildx` command packages the build, manifest and push commands together into a user-friendly interface.


To get started with buildx, we need to create a custom BuildKit. This is as simple as:

docker buildx create --name kratix-image-builder

Once created, buildx is mostly compatible with previous docker commands. For example, these docker commands:

docker build –tag syntasso/kratix-platform:main .
docker push syntasso/kratix-platform:main

Becomes this docker buildx command:

docker buildx build --builder kratix-image-builder --push --platform linux/arm64,linux/amd64 –tag syntasso/kratix-platform:main .

You can see the full buildx docs here, and our code change can be seen here.


Optimising multiple architectures builds

While using buildx was enough to build and push both arm64 and amd64 images, compiling Kratix on an amd64 image on M1 Macs was very slow. This was due to the same reasons that Kratix itself was slow on M1s, trying to emulate a different architecture is a lot of work.


Thankfully, Kratix is built in Golang and we can leverage the cross-compiling capabilities of the language to compile an amd64 binary on an arm64 host. That, combined with the multi-staged capabilities of Docker and the automatic platform ARGs in the global scope when using Buildkit, allowed us to compile fast, while shipping images on the right architecture. The resulting Dockerfile can be seen here.


Auditing dependencies

Next, we shifted our attention to our dependencies, the images we use in the Kratix sample promises. As seen in this article, our jenkins example promise was broken. Once we knew the wrong architecture was the issue, our attention turned to the underlying images we were deploying with Jenkins. We found we were using the following images:

  • Jenkins-operator: virtuslab/jenkins-operator:v0.7.0

  • Jenkins: jenkins/jenkins:2.332.2-lts-alpine

As expected, neither image provided an arm64 build. After upgrading to versions with multi-arch builds, we had a much more stable Jenkins promise. We did a similar exercise with all images in our sample promises directory.


Failing fast

Finally, we discussed whether we should try to fail fast next time we use an image that does not provide the right underlying architecture. We considered a few options:

  • Use a container engine that does not attempt emulation for development (e.g. podman instead of Docker)

  • Disable execution of different architectures (through qemu-user-static) on Docker

  • Write an automated test that scans architectures


In the end, we decided to do, well, nothing.


We are still running some images that are not compiled for arm64 but are also not causing any significant impact (most do something very simple). Replacing Docker (or globally disabling multi-architecture) would cause more pain than benefits and separate us further from our customer setups. For now, we are happy to take on the risk and keep an eye open for any performance degradation we may encounter — we know where we should look first.


Conclusion


That was our long journey to get Kratix and Jenkins actually working on M1 Macs. As you can see, we found ourselves in multiple deadends, until we identified the performance impact when running images with different architectures. We hope that this article gives you some baseline information on how (and why) to build images for more than one platform.


And if you’d like to learn more about our experience or about Kratix, please get in touch!


7,890 views1 comment

1 Comment


Mohamed Shahat
Mohamed Shahat
Jan 14, 2023

Hello, thanks for the blog and for sharing your experience. I'm still working on some workflows moving from x86 to arm64 ... What risks do you see introduced by working locally on arm64 stacks while production deployments are largely x86? Cheers, /Mo

Like
bottom of page