Speed up building Bottlerocket image in AWS CodeBuild

When I first moved building Bottlerocket AMI from an EC2 host to AWS CodeBuild, I was hit by a very slow build. On an EC2 instance, I built both the x86 and Arm versions on x86 instances, and fresh builds finished in 5 minutes. However, on CodeBuild with more vCPU and memory, the build process was painfully slower.

  1. The x86 CodeBuild uses a compute of "145 GB memory, 72 vCPUs". The build finishes in 18 minutes, with the build-variant step taking 97% of the time.
  2. The Arm CodeBuild uses a compute of "96 GB memory, 48 vCPUs". The build finishes in 71 minutes, with the build-variant step taking 99% of the time.

The Bottlerocket team recommends cross-building Arm on x86. With that change, the Arm AMI build finishes in 20 minutes, close to the x86 AMI. But what makes building Bottlerocket in CodeBuild slow?

Building Bottlerocket in CodeBuild: Docker in Docker

Bottlerocket builds RPM packages in parallel, with each package built in a Docker container. Once all RPM packages are built, it assembles them into one OS image. Learn more at Building Bottlerocket.

AWS CodeBuild sets up a Docker container and runs commands specified in the buildspec.

In order to build Bottlerocket AMIs in CodeBuild, I build a Docker image with the Docker engine installed. When building Bottlerocket AMIs, each docker build runs as a Docker container inside the CodeBuild container—effectively, a Docker-in-Docker situation.

1version: 0.2
2phases:
3  install:
4    commands:
5      - nohup /usr/bin/dockerd --host=unix:///var/run/docker.sock --host=tcp://127.0.0.1:2375 &
6      - timeout 15 sh -c "until docker info; do echo .; sleep 1; done"
7  build:
8    commands:
9      - cargo make ami <-- Build Bottlerocket AMI

Suspicious Build Logs

There are two suspicious logs: "no space left" and "connection reset". No space left on device:

1#11 [rpmbuild 5/6] RUN --mount=source=.cargo,target=/home/builder/.cargo     --mount=type=cache,target=/home/builder/.cache,from=cache,source=/cache     --mount=source=sources,target=/home/builder/rpmbuild/BUILD/sources     --mount=target=/host     /host/build/tools/unplug       rpmbuild -bb --clean         --undefine _auto_set_build_flags         --define "_target_cpu ${ARCH}"         --define "dist .${BUILD_ID_TIMESTAMP}.${BUILD_ID//-dirty/}.br1"         rpmbuild/SPECS/${PACKAGE}.spec
2#11 ERROR: failed to prepare kvval4pm8xbe9mmwxih4c4aha as n7oxk3onfvk4pp9ioc73dy3pd: no space left on device

, and repeated connection issues with the Docker daemon:

12025-02-13 01:32:41 2025/02/13 01:32:39 http2: server: error reading preface from client @: read unix /var/run/docker.sock->@: read: connection reset by peer
22025-02-13 01:32:41 2025/02/13 01:32:39 http2: server: error reading preface from client @: read unix /var/run/docker.sock->@: read: connection reset by peer
32025-02-13 01:32:41 2025/02/13 01:32:40 http2: server: error reading preface from client @: read unix /var/run/docker.sock->@: read: connection reset by peer
42025-02-13 01:32:41 2025/02/13 01:32:40 http2: server: error reading preface from client @: read unix /var/run/docker.sock->@: read: connection reset by peer
5...

They tell a story: the Docker daemon running inside the CodeBuild container is under high pressure. The build uses too much space, and the Docker daemon times out on new connections. This is strange, as on an EC2 instance with smaller vCPU and memory, the build completes quickly.

"VOLUME /var/lib/docker"

What we're hitting is a classic Docker quirk. During Docker builds, /var/lib/docker is extremely write-heavy. Docker constantly writes layer tarballs, unpacked layers, metadata, image diffs, and temporary files there. If /var/lib/docker is on the container's overlayfs filesystem, every write may trigger: overlay copy-up, metadata duplication, CoW penalties, and inode churn. What makes it worse for the Bottlerocket build is that each package build starts its own inner container, which has its own /var/lib/docker as part of the container's filesystem. This is why we see "no space left on device".

Therefore, a performance hack for Docker-in-Docker is to add VOLUME /var/lib/docker to the Dockerfile of the outer Docker container image. In Docker, the VOLUME directive creates a mount point with a specified path and marks it as holding externally mounted volumes from the native host or other containers. In our use case, /var/lib/docker would use the host filesystem, bypassing overlayfs overhead.

Experiments

We can see the difference of with and without VOLUME clearly with the following experiment.

Step 1. On an EC2 instance, build two images: build-without-volume and build-with-volume using the following Dockerfile.

1FROM amazonlinux:2023
2
3RUN dnf install -y docker && dnf clean all
4
5# Add VOLUME for build-with-volume
6# VOLUME /var/lib/docker
7
8CMD ["sleep", "infinity"]

Step 2. Run without VOLUME

 1# On the EC2 instance, run the outer container
 2docker run --privileged -d --name build build-without-volume
 3
 4# /var/lib/docker is an overlayfs belonging to the container
 5$ docker exec -it build bash -c "mount | grep /var/lib/docker"
 6overlay on /var/lib/docker type overlay (rw,relatime,lowerdir=/local/docker/overlay2/l/7BVV334IW6FJEJNKDUWVXCV4E4:/local/docker/overlay2/l/FYA4RDDNBV4K4IQWUWCPV3DEBO:/local/docker/overlay2/l/BJAKNKSPFWTGBP2K5TOQNXN2GY,upperdir=/local/docker/overlay2/c74bd4bca415647bdfc909de86cbbc51b808328a85ab20273c13fd9c34e3c8e5/diff,workdir=/local/docker/overlay2/c74bd4bca415647bdfc909de86cbbc51b808328a85ab20273c13fd9c34e3c8e5/work)
 7
 8# There is no mount on the host
 9$ docker inspect build | jq '.[0].Mounts'
10[]

Step 3. Run with VOLUME

 1# On the EC2 instance, run the outer container
 2docker run --privileged -d --name build build-with-volume
 3
 4$ docker exec -it build bash -c "mount | grep /var/lib/docker"
 5/dev/nvme0n1p1 on /var/lib/docker type ext4 (rw,noatime)
 6
 7$ docker inspect build | jq '.[0].Mounts'
 8[
 9  {
10    "Type": "volume",
11    "Name": "45d34f9a515111612bf17d27a31a9a03bd17556be769e7a6cf11ca94ad3d69fb",
12    "Source": "/local/docker/volumes/45d34f9a515111612bf17d27a31a9a03bd17556be769e7a6cf11ca94ad3d69fb/_data",
13    "Destination": "/var/lib/docker",
14    "Driver": "local",
15    "Mode": "",
16    "RW": true,
17    "Propagation": ""
18  }
19]

As you can see, without VOLUME, each container has its own /var/lib/docker as the container's own overlayfs. With VOLUME, /var/lib/docker uses the host's native filesystem and is shared by all inner containers.