Mind ordering cycles in systemd: how systemd breaks them can brick server startup
I've been building a service for a month, and the day finally arrived when I had the artifact - an EC2 AMI. The AMI passed my "rigorous" manual tests so I launched 100 EC2 instances. Surprise! Around 28 instances failed to launch.
What's going on?
All failed instances were stuck in the "initializing" state, and the only way to connect to them was through EC2 Serial Console. There, I noticed something interesting.
1[ 10.702819] systemd[1]: sysinit.target: Job systemd-tmpfiles-setup.service/start deleted to break ordering cycle starting with sysinit.target/start
2 ...
3[ SKIP ] Ordering cycle found, skipping Create Volatile Files and Directories
4 ...
5[ 17.064638] netdog[1369]: Failed to write primary interface to '/var/lib/netdog/primary_interface': No such file or directory (os error 2)
The systemd-tmpfiles service is responsible for setting up files based on configurations in tmpfiles.d, including directories like /etc/ and /var/. In the failed case, systemd detected an ordering cycle, resulting in systemd-tmpfiles being skipped - hence the log entry "skipping Create Volatile Files and Directories." Without /var/lib/, netdog failed to configure the network and the instance failed.
Where is the cycle?
Going through systemd units added in my change, foo.service looks suspicious.
1bash-5.2# systemd-analyze verify /etc/systemd/system/foo.service
2systemd-update-done.service: Found ordering cycle on local-fs.target/start
3systemd-update-done.service: Found dependency on foo.service/start
4systemd-update-done.service: Found dependency on sysinit.target/start
5systemd-update-done.service: Found dependency on systemd-update-done.service/start
6systemd-update-done.service: Job local-fs.target/start deleted to break ordering cycle starting with systemd-update-done.service/start
Here's the unit file for foo.service:
1[Unit]
2Before=local-fs.target
3 ...
4[Install]
5WantedBy=local-fs.target
This configuration means local-fs.target depends on foo.service, corresponding to lines 2-3 in the systemd-analyze output. But how does foo.service depend on local-fs.target (lines 3-6)?
Default dependencies
According to man systemd.service:
Unless DefaultDependencies= is set to false, service units will implicitly have dependencies of type Requires= and After= on basic.target as well as dependencies of type Conflicts= and Before= on shutdown.target. These ensure that normal service units pull in basic system initialization, and are terminated cleanly prior to system shutdown. Only services involved with early boot or late system shutdown should disable this option.
Since foo.service doesn't disable DefaultDependencies, it implicitly depends on basic.target, which depends on sysinit.target, which depends on local-fs.target. This creates a cycle: local-fs.target → foo.service → local-fs.target.
The fix is simple: add DefaultDependencies=no to foo.service.
Why did only 28% of instances fail?
If there's a cycle and breaking it causes startup failures, we'd expect all instances to fail, right? Not necessarily. In graph theory, breaking a cycle can be done by removing any edge in the loop. For systemd, this means delaying any unit involved in the cycle. However, not all units are equally critical. If systemd delays a crucial unit like systemd-tmpfiles-setup, the startup fails. Otherwise, the system starts successfully.
We can confirm this from the log of a successful startup:
1[ 9.867740] systemd[1]: systemd-tmpfiles-setup.service: Job local-fs.target/start deleted to break ordering cycle starting with systemd-tmpfiles-setup.service/start
In this case, systemd broke the same cycle by delaying local-fs.target. This worked because systemd-tmpfiles could still create the necessary files, allowing netdog to configure the network successfully.