AL2023 vs. AL2: less disk space with ext4?

We started migrating from Amazon Linux 2 (AL2) to Amazon Linux 2023 (AL2023) a month ago. While testing workloads on AL2023 in the pre-production environment, I noticed slightly higher disk usage compared to the same workload on AL2. In this post, I'll share my investigation.

AL2023 has less free disk space with ext4, compared to AL2

Although disk usage metrics increased on AL2023, the "Used" space remained the same. The higher disk usage is due to a decrease in the total ext4 filesystem size. The table below compares sizes of ext4 formatted EBS volumes.

EBS Volume (GiB) AL2 (Bytes) AL2023 (Bytes) Decrease (MiB)
30 31,526,391,808 31,526,391,808 0
32 33,635,975,168 33,501,757,440 128
64 67,452,317,696 67,049,664,512 384
128 135,084,904,448 134,145,380,352 896

For example, on a 128GiB EBS volume, the free space decreased by 0.7%, from 125.8GiB to 124.9 GiB. You can reproduce this behavior using the steps below.

 1# Pin AMIs for reproducing.
 2# AL2: ami-04907d7291cd8e06a, amzn2-ami-kernel-5.10-hvm-2.0.20241031.0-x86_64-gp2
 3# AL2023: ami-066a7fbea5161f451, al2023-ami-2023.6.20241031.0-kernel-6.1-x86_64
 4
 5INSTANCE_TYPE=t3.medium
 6INSTANCE_ID=$(aws ec2 run-instances \
 7    --image-id ${AMI_ID} \
 8    --instance-type ${INSTANCE_TYPE} \
 9    --subnet-id ${SUBNET_ID} \
10    --key-name example-1 \
11    --security-group-ids ${SECURITY_GROUP} \
12    --associate-public-ip-address | jq  -r '.Instances[0].InstanceId')
13
14uname -r
155.10.227-219.884.amzn2.x86_64
16
17# Create a 128GiB EBS volume and attach it to the instance.
18VOLUME_ID=$(aws ec2 create-volume --availability-zone us-west-2a --size 128 --volume-type gp3 | jq -r '.VolumeId')
19aws ec2 attach-volume --volume-id ${VOLUME_ID} --instance-id ${INSTANCE_ID} --device /dev/sdb
20
21# Confirm the block device size is 128GiB.
22sudo blockdev --getsize64 /dev/nvme1n1
23137438953472
24
25# Create ext4 fs and mount it. 
26sudo mkfs -t ext4 /dev/sdb
27sudo mkdir /mnt/data1 && sudo mount /dev/sdb /mnt/data1
28
29# Check total fs size: "1B-blocks"
30df --block-size=1 /mnt/data1
31Filesystem        1B-blocks  Used    Available Use% Mounted on
32/dev/nvme1n1   135084904448 24576 128196157440   1% /mnt/data1

Does this mean applications have less disk space?

It appears the filesystem uses more space for metadata on AL2023 than on AL2. Metadata accounting is tricky and filesystem behavior as free space is exhausted is complicated, so I wanted to confirm if the writable space for applications is genuinely reduced. Using the script below, I created 50 MiB files until no space was left.

 1#!/bin/sh
 2
 3# Directory to write files to
 4TARGET_DIR=$1
 5# Size of each file in MiB
 6FILE_SIZE=$2
 7
 8count=0
 9total_size_mb=0
10
11echo "Writing ${FILE_SIZE}MiB files to ${TARGET_DIR} until disk is full..."
12
13while true; do
14    filename="${TARGET_DIR}/file_${count}.dat"
15    if dd if=/dev/zero of="$filename" bs=1M count=$FILE_SIZE status=none 2>/dev/null; then
16        ((count++))
17        total_size_mb=$((count * FILE_SIZE))
18        echo -ne "\rFiles created: $count (Total: ${total_size_mb}MiB)"
19    else
20        echo -e "\n\nDisk full or error occurred!"
21        break
22    fi
23done

Here are the results for a 128 GiB EBS volume:

OS Number of 50MiB files Total size of files (MiB) Used space (MiB) Used space / volume size
AL2 2,445 122,250 122,257 93.27%
AL2023 2,427 121,350 121,361 92.59%

Since the last 50MiB file was written partially, "Used space" is larger than the "Total size of files". The difference in used space between AL2 and AL2023 is 896 MiB, matching the filesystem size difference. Thus, applications indeed have less writable disk space on AL2023 when using ext4.

Is the difference caused by mke2fs.conf?

The mkfs -t ext4 command uses mkfs.ext4 to create ext4 file systems, so the difference could stem from different configurations in AL2 and AL2023. Comparing the features using dumpe2fs:

 1# AL2
 2sudo dumpe2fs /dev/sdb | grep features
 3dumpe2fs 1.42.9 (28-Dec-2013)
 4Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
 5Journal features:         journal_64bit
 6
 7# AL2023
 8sudo dumpe2fs /dev/sdb | grep features
 9dumpe2fs 1.46.5 (30-Dec-2021)
10Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file dir_nlink extra_isize metadata_csum
11Journal features:         journal_64bit journal_checksum_v3

While the configurations differ, swapping mke2fs.conf between AL2023 and AL2 yielded the same free space difference. Therefore, mke2fs.conf is not the cause.

What about the mkfs version?

The mkfs utility is installed from the e2fsprogs package, which differ between AL2 (v1.42.9) and AL2023 (v1.46.5). To test its impact, I attempted to swap mkfs versions. However, the AL2 repository does not provide e2fsprogs-1.46.5, neither does the AL2023 repository provide e2fsprogs-1.42.9. So I built e2fsprogs-1.46.5 from source on AL2. Building e2fsprogs-1.42.9 on AL2023 failed due to a compile error, "undefined reference to makedev", likely caused by an incompatibility with libc on AL2023.

Using v1.46.5 to format an EBS volume on AL2 resulted in a free space of 134,145,380,352 bytes, identical to AL2023. Running the file-writing script also yielded 2,427 files, the same as AL2023. This confirms that the difference in free space is caused by the mkfs version.

The e2fsprogs package also installs tune2fs, which shows and modifies ext parameter. Comparing parameters between v1.42.9 and v1.46.5 revealed most were identical, except for an increase in "extra isize" from 28 to 32.

1# sudo tune2fs -l /dev/sdb
2Inode count:              8388608
3Block count:              33554432
4Reserved block count:     1677721
5Inode size:	          256
6Required extra isize:     32  # For v1.42.9, this value is 28.
7Desired extra isize:      32  # For v1.42.9, this value is 28.
8Journal inode:            8

It's the journal size!

After posting on HackerNews, josephcsible commented, pointing out that the difference in disk usage comes from a change in the default journal size introduced in commit bbd2f78c.

 1commit bbd2f78cf63ab4c635d76073605d6fb1a30c277c
 2Author: Theodore Ts'o <[email protected]>
 3Date:   Thu Sep 1 11:37:59 2016 -0400
 4
 5    libext2fs: allow the default journal size to go as large as a gigabyte
 6
 7    Recent research has shown that for a metadata-heavy workload, a 128 MB
 8    is journal be a bottleneck on HDD's, and that the optimal journal size
 9    is proportional to number of unique metadata blocks that can be
10    modified (and written into the journal) in a 30 second window.  One
11    gigabyte should be sufficient for most workloads, which will be used
12    for file systems larger than 128 gigabytes.
13
14--- a/lib/ext2fs/mkjournal.c
15+++ b/lib/ext2fs/mkjournal.c
16
17 int ext2fs_default_journal_size(__u64 num_blocks)
18 {
19        if (num_blocks < 2048)
20                return -1;
21-       if (num_blocks < 32768)
22-               return (1024);
23-       if (num_blocks < 256*1024)
24-               return (4096);
25-       if (num_blocks < 512*1024)
26-               return (8192);
27-       if (num_blocks < 1024*1024)
28-               return (16384);
29-       return 32768;
30+       if (num_blocks < 32768)         /* 128 MB */
31+               return (1024);                  /* 4 MB */
32+       if (num_blocks < 256*1024)      /* 1 GB */
33+               return (4096);                  /* 16 MB */
34+       if (num_blocks < 512*1024)      /* 2 GB */
35+               return (8192);                  /* 32 MB */
36+       if (num_blocks < 4096*1024)     /* 16 GB */
37+               return (16384);                 /* 64 MB */
38+       if (num_blocks < 8192*1024)     /* 32 GB */
39+               return (32768);                 /* 128 MB */
40+       if (num_blocks < 16384*1024)    /* 64 GB */
41+               return (65536);                 /* 256 MB */
42+       if (num_blocks < 32768*1024)    /* 128 GB */
43+               return (131072);                /* 512 MB */
44+       return 262144;                          /* 1 GB */
45 }

For a 128GiB volume, the number of blocks allocated for the journal increased from 32,768 to 262,144. Since each block is 4KiB, the journal size grew from 32,768 * 4KiB=128MiB to 262,144 * 4KiB = 1GiB — an increase of 896MiB, matching the discrepancy we observed. I repeated this calculation for the other three volume sizes mentioned at the start of this post, and the differences aligned with the changes in journal size as well.

We can override the journal size using the -J option. For example, mkfs.ext4 -J size=128 /dev/sdb sets the journal size to 128MiB, restoring the missing 896MiB on a 128GiB volume.

Should you override the journal size?

It depends. In my case, I chose to override it — at least during the migration — for the following reasons.

  1. We use gp3 volumes. The commit that increased the default journal size was introduced in 2016, targeting metadata-heavy workloads on HDDs. However, gp3 volumes are SSDs, where the same benefits may not apply.
  2. Uncertainty about metadata-heavy workloads. The commit noted that a larger journal size helps metadata-heavy workloads. However, I cannot determine if our customers' workloads are metadata-heavy because I lack visibility into our customers' workloads.
  3. Minimizing changes during migration. To limit variables introduced during the migration from AL2 to AL2023, I opted to keep the journal size consistent with AL2. Our minimum volume size is 30GiB, and overriding the journal size to 128MiB ensures parity with defaults on AL2.

Closing thoughts

Catching this issue in pre-production was fortunate, but receiving help from the community was even more so. A special thanks to josephcsible.

Thank you for reading! Your feedback is greatly appreciated, feel free to comment.