Mind ordering cycles in systemd: how systemd breaks them can brick the server start up

I've been building a service for a month and the day finally arrived when I had the artifact - an EC2 AMI. The AMI passed my "rigourous" manual tests, and I felt confident on a Ruby Tuesday, so I launched 100 EC2 instances using the AMI. Surprise! around 28 instances failed to launch.

What is going on?

All failed instances were stuck in the "initializing" state and the only way to connect to them is EC2 Serial Console. There, I noticed something interesting.

1[   10.702819] systemd[1]: sysinit.target: Job systemd-tmpfiles-setup.service/start deleted to break ordering cycle starting with sysinit.target/start
2               ...
3[ SKIP ] Ordering cycle found, skipping Create Volatile Files and Directories
4               ...
5[   17.064638] netdog[1369]: Failed to write primary interface to '/var/lib/netdog/primary_interface': No such file or directory (os error 2)

The systemd-tmpfiles service is responsible for setting up files based on configurations in tmpfiles.d. This includes /etc/ and /var/. In the failed case, systemd detected an ordering cycle, resulting in the systemd-tmpfiles being skipped - hence, the log entry, "skipping Create Volatile Files and Directories." Without /var/lib/, the netdog failed to configure network and the instance failed.

Where is the cycle?

Going through systemd units added in my change, the five-little-ducks.service (hope you like the name) looks suspicious.

1bash-5.2#  systemd-analyze verify /etc/systemd/system/five-little-ducks.service
2systemd-update-done.service: Found ordering cycle on local-fs.target/start
3systemd-update-done.service: Found dependency on five-little-ducks.service/start
4systemd-update-done.service: Found dependency on sysinit.target/start
5systemd-update-done.service: Found dependency on systemd-update-done.service/start
6systemd-update-done.service: Job local-fs.target/start deleted to break ordering cycle starting with systemd-update-done.service/start

The following is the unit file of the five-little-ducks.service.

1[Unit]
2Before=local-fs.target 
3   ...
4[Install]
5WantedBy=local-fs.target

The file says that local-fs depends on five-little-ducks, corresponding to lines 2-3 in the systemd-analyze output. But how does five-little-ducks depend on local-fs (lines 3-6)?

Default dependencies

According to man systemd.service,

Unless DefaultDependencies= is set to false, service units will implicitly have dependencies of type Requires= and After= on basic.target as well as dependencies of type Conflicts= and Before= on shutdown.target. These ensure that normal service units pull in basic system initialization, and are terminated cleanly prior to system shutdown. Only services involved with early boot or late system shutdown should disable this option.

Since five-little-ducks doesn't disable DefaultDependencies, it implicitly depends on basic.target, which depends on sysinit.target, which depends on local-fs.target. This creates a cycle: local-fs -> five-little-ducks -> local-fs. The fix is simple - add DefaultDependencies=no to the five-little-ducks.service.

Why did only 28% of instances fail?

If there's a cycle and breaking it causes startup failures, we'd expect all instances to fail, right? Not true. In Graph theory, breaking a cycle can be done by removing any edge in the loop. For systemd, this means delaying any unit involved in the cycle. However, not all units are equally critical. If systemd delays a crucial unit - like systemd-tmpfiles-setup, the startup fails. Otherwise, the system starts successfully. We can confirm this from the log of a a successful startup.

1[    9.867740] systemd[1]: systemd-tmpfiles-setup.service: Job local-fs.target/start deleted to break ordering cycle starting with systemd-tmpfiles-setup.service/start

As you can see, in this case, systemd broke the same cycle by delaying local-fs. This worked out because systemd-tmpfiles could create necessary files, allowing netdog to configure the network successfully.