EC2 IMDS is Unstable During Early Boot: Always Retry
In Detect and fix rare cases where the primary ENI does not serve default traffic , we used IMDS "meta-data/mac" to get the primary ENI's MAC address. However, we encountered the following errors in 0.5% of EC2 instance launches:
1failed to get IMDS /mac: operation error ec2imds: GetMetadata, exceeded maximum number of attempts, 3,
2http response error StatusCode: 401, request to EC2 IMDS failed
1failed to get IMDS /mac: operation error ec2imds: GetMetadata, exceeded maximum number of attempts, 3,
2request send failed, Get "http://169.254.169.254/latest/meta-data/mac": dial tcp 169.254.169.254:80: connect: connection refused
What is IMDS
The Instance Metadata Service (IMDS) provides EC2 instance information like UserData, ENI details, IAM credentials, and spot interruption notices. IMDS runs at link-local addresses 169.254.169.254 or fd00:ec2::254, outside the guest OS as a hypervisor feature. Any EC2 instance with an active network interface can access IMDS.
Why retry logic is essential
Initially, we thought "401 - Unauthorized" indicated a missing session token. However, "connection refused" occurred over half the time, suggesting IMDS instability during early boot.
Bottlerocket's early-boot-config.service also queries IMDS and waits for "network-online.target":
1Description=Bottlerocket userdata configuration system
2# Need network online to talk to IMDS.
3After=network-online.target apiserver.service storewolf.service
When querying IMDS, it retries on failure:
1fn retry_strategy() -> impl Iterator<Item = Duration> {
2 // Retry attempts at 0.25s, 0.5s, 1s, 1.75s, 3s, 5s, 8.25s, 13.5s, 22s and then every 10s after.
3 FibonacciBackoff::from_millis(250).max_delay(Duration::from_secs(10))
4}
Our script runs even earlier than early-boot-config.service
—immediately after eth0 comes up—to fix ENI ordering for security before any internet traffic can leave the instance: see systemd-networkd-wait-online.service.
Since IMDS implementation is abstracted from us, and our observations confirm instability during early boot, we added retry logic similar to Bottlerocket's approach. This eliminated both the "401 - Unauthorized" and "connection refused" errors.
The IMDS documentation confirms that retrying on 401 and 503 is appropriate:
- 400 - Missing or Invalid Parameters – The PUT request is not valid.
- 401 - Unauthorized – The GET request uses an invalid token. The recommended action is to generate a new token.
- 403 - Forbidden – The request is not allowed or the IMDS is turned off.
- 404 - Not Found – The resource is not available or there is no such resource.
- 503 – The request could not be completed. Retry the request.