Mysterious Image Pull Failures: "401 Unauthorized" and "Not Found" After Migrating Containerd to v2

Early this year, we migrated containerd from v1.7 to v2.0.5. However, we quickly noticed image pulls from Amazon Elastic Container Registry (ECR) began failing for both public and private ECR repositories. For example:

1# public ECR
2FATA[0031] failed to resolve reference "public.ecr.aws/aws-cli/aws-cli:2.31.5@sha256:9cb6ab9c8852d7e1e63f43299dca0628e92448d1c7589f7ed40344e7f61aad59": \
3unexpected status from HEAD request to https://public.ecr.aws/v2/aws-cli/aws-cli/blobs/sha256:9cb6ab9c8852d7e1e63f43299dca0628e92448d1c7589f7ed40344e7f61aad59: \
4401 Unauthorized
5
6# private ECR
7FATA[0031] failed to resolve reference "<account-id>.dkr.ecr.us-west-2.amazonaws.com/xxxx@sha256:1db8db35b1afaa9d2df40f68e35cc0e4f406b5f73667b1ba2d0f73dbb15aed01": \
8<account-id>.dkr.ecr.us-west-2.amazonaws.com/xxxx@sha256:1db8db35b1afaa9d2df40f68e35cc0e4f406b5f73667b1ba2d0f73dbb15aed01: \
9not found

The failures occurred at a low frequency of around 0.5%. This was bizarre because 401 Unauthorized and 404 not found are errors you'd expect to either happen all the time or not at all. The different status codes for public versus private ECR repositories were also intriguing.

In this post, we'll show how upgrading the containerd client from v1.7 to v2 triggered a ticking bomb in our code. We'll prove the root cause by reproducing it with nerdctl.

Suspicious long delay of initial request

The first step of pulling an image is resolving image references (remotes.Resolver.Resolve). This is where the two errors originate. Looking at the logs, we noticed a pattern: when these errors occurred, the very first request made to ECR suffered long delays of more than 10 seconds and failed. In successful image pulls without these errors, the initial request succeeded in less than half a second.

The Resolve function first checks if a digest is provided. If one is provided, it adds two paths corresponding to the digest for HTTP HEAD requests: one for /manifests and another for /blobs. We'll call these the manifest path and blob path for simplicity.

 1// https://github.com/containerd/containerd/blob/v2.0.5/core/remotes/docker/resolver.go#L260-L264
 2if dgst != "" {
 3    if err := dgst.Validate(); err != nil {
 4        return "", ocispec.Descriptor{}, err
 5    }
 6    paths = append(paths, []string{"manifests", dgst.String()})
 7    paths = append(paths, []string{"blobs", dgst.String()})
 8}  else {
 9    paths = append(paths, []string{"manifests", refspec.Object})
10    caps |= HostCapabilityResolve
11}
12
13// ...
14for _, u := range paths {
15    for i, host := range hosts {
16        req := base.request(host, http.MethodHead, u...)

In our production system, we always use digests for repositories hosted on ECR to avoid image tags being reused. So in our case, both the manifest path and blob path are added for the digest. However, we expect the initial HEAD request to the manifest path to always succeed, so there should be no "fallback to blobs".

Hypothesis: the initial request times out

If the initial request times out, the client will make HEAD requests to the blob path. In our system, we don't authenticate the client to pull from public ECR repositories. For public ECR repositories, HEAD requests to manifest paths do not require authorization. However, HEAD requests to blob paths do require authorization, so the request will fail with 401 Unauthorized.

For private ECR repositories, the client is authenticated and sets correct authorization headers for HEAD requests. However, if the manifest path times out, the blob path will return 404 not found because the digest is for the manifest, not a blob.

So the hypothesis explains the symptoms we observed. But there are more questions to answer.

The ticking bomb: optimistic pulling

Our production system runs on EC2. For security, we use separate Elastic Network Interfaces (ENI) to run containers, including pulling images. There's an unknown delay between ENI attachment and ENI readiness. That is, Linux recognizing eth1 doesn't necessarily mean eth1 is ready to serve traffic. As you can imagine, attaching an ENI is a separate asynchronous procedure not managed by the container runtime. Thus, we chose to start pulling images as soon as the container runtime was ready. At low frequency, the ENI might not be ready. In those cases, the container runtime would retry image pulling and succeed. This choice simplified the container runtime code and worked well for years. However, it was a ticking bomb, assuming the error returned in case of network timeout would be retriable.

Why the upgrade to containerd v2 triggers the bomb

Containerd v2 includes the change docker: return most relevant error from docker resolution. After this change, the behavior changed to return the error with the highest priority (from lowest to highest):

  • Underlying transport errors (TLS, TCP, etc)
  • 404 errors
  • Other 4XX/5XX errors
  • Manifest rejection (due to max size exceeded)

Therefore, the container runtime no longer sees the underlying transport error but now sees the higher-level errors from the "blob" request. In our code, network timeout errors are retriable, but "401 Unauthorized" and "404 not found" are not retriable, causing the image pull to fail.

Proof by construction: reproduce the error with iptables and nerdctl

Most of the time (99.5%), the ENI is ready to serve traffic when the container runtime starts pulling images. So reproducing this with ENI requires a complicated setup, which adds noise. Here, we'll simulate the ENI not being ready by dropping all traffic to ECR.

Public ECR repository

 1# Step 1. Clean state
 2 % sudo systemctl stop containerd && sudo rm -rf /var/lib/containerd/* &&  sudo systemctl start containerd
 3
 4# Step 2. Drop all traffic to public ECR
 5 % sudo iptables -A OUTPUT -d public.ecr.aws -j DROP
 6 
 7# Step 3. Pull image 
 8 % sudo nerdctl --debug-full pull public.ecr.aws/aws-cli/aws-cli:2.31.5@sha256:9cb6ab9c8852d7e1e63f43299dca0628e92448d1c7589f7ed40344e7f61aad59
 9
10# Step 4. Right after the image pull times out on the initial HEAD request (by default, nerdctl timeout is 30 seconds), re-enable ECR traffic
11 % sudo iptables -D OUTPUT -d public.ecr.aws -j DROP

The image pull fails with "401 Unauthorized". Click to see full log.

1FATA[0031] failed to resolve reference "public.ecr.aws/aws-cli/aws-cli:2.31.5@sha256:9cb6ab9c8852d7e1e63f43299dca0628e92448d1c7589f7ed40344e7f61aad59": 
2unexpected status from HEAD request to https://public.ecr.aws/v2/aws-cli/aws-cli/blobs/sha256:9cb6ab9c8852d7e1e63f43299dca0628e92448d1c7589f7ed40344e7f61aad59: 
3401 Unauthorized

Private ECR repository

 1# Step 1. Clean state 
 2 % sudo systemctl stop containerd && sudo rm -rf /var/lib/containerd/* &&  sudo systemctl start containerd
 3
 4# Step 2. Drop all traffic to private ECR. 
 5 % sudo iptables -A OUTPUT -d <account-id>.dkr.ecr.us-west-2.amazonaws.com -j DROP
 6 
 7# Step 3. Authenticate and pull image 
 8 # you can use docker login and logout as well.
 9 % sudo nerdctl logout <account-id>.dkr.ecr.us-west-2.amazonaws.com
10 % aws ecr get-login-password --region us-west-2 | sudo nerdctl login --username AWS --password-stdin <account-id>.dkr.ecr.us-west-2.amazonaws.com
11 % sudo nerdctl --debug-full pull <account-id>.dkr.ecr.us-west-2.amazonaws.com/al2023@sha256:1db8db35b1afaa9d2df40f68e35cc0e4f406b5f73667b1ba2d0f73dbb15aed01
12
13# Step 4. Right after the image pull times out on the initial HEAD request (by default, nerdctl timeout is 30 seconds), re-enable ECR traffic
14 % sudo iptables -D OUTPUT -d <account-id>.dkr.ecr.us-west-2.amazonaws.com -j DROP

The image pull fails with "not found". Click to see full log.

1FATA[0031] failed to resolve reference "<account-id>.dkr.ecr.us-west-2.amazonaws.com/al2023@sha256:1db8db35b1afaa9d2df40f68e35cc0e4f406b5f73667b1ba2d0f73dbb15aed01": 
2<account-id>.dkr.ecr.us-west-2.amazonaws.com/al2023@sha256:1db8db35b1afaa9d2df40f68e35cc0e4f406b5f73667b1ba2d0f73dbb15aed01: 
3not found

Let's fix it

Now that we've proven our hypothesis, how do we fix the issue? It's not wise to continue depending on retrying image pulls for retriable errors, simply because the errors returned by the containerd client (code we cannot control) cannot be trusted. Therefore, we fixed the issue by adding a dependency to ensure the container runtime only pulls images when the ENI is ready to serve traffic. A simple choice would be to ping through the interface, e.g., "ping -I eth1 -c 1 1.1.1.1". However, this approach is not ideal since: 1) ping is not installed in the Bottlerocket OS we use in production, and 2) 1.1.1.1 is not available in air-gapped regions. Therefore, we determine the ENI is ready if the interface status is RUNNING and has IP addresses assigned.

 1// isEthReady checks if the given network interface is ready to serve traffic. 
 2// Interface must be RUNNING and have non-loopback IP addresses assigned.
 3func isEthReady(interfaceName string) (bool, error) {
 4    iface, err := net.InterfaceByName(interfaceName)
 5    if err != nil {
 6        return false, fmt.Errorf("failed to get interface %s: %w", interfaceName, err)
 7    }
 8
 9    if iface.Flags&net.FlagLoopback != 0 || iface.Flags&net.FlagUp == 0 || iface.Flags&net.FlagRunning == 0 {
10        return false, nil
11    }
12
13    addrs, err := iface.Addrs()
14    if err != nil {
15        return false, fmt.Errorf("failed to get addresses for interface %s: %w", interfaceName, err)
16    }
17
18    return slices.ContainsFunc(addrs, func(addr net.Addr) bool {
19        if ipNet, ok := addr.(*net.IPNet); ok {
20            return !ipNet.IP.IsLoopback()
21        }
22        return false
23    }), nil
24}