Use KillMode=process with caution: restart loop could deplete resources

I recently debugged a resource leak where a systemd service kept restarting while leaving a process behind after each restart. The root cause isn't particularly interesting: a backward-incompatible third-party dependency upgrade. But the debugging process and lessons learned are.

Thousands of zombie processes from a systemd service

I have a foo.service that runs /usr/bin/start-foo, which spawns a new process to run /usr/bin/foo. A few minutes after boot, CPU and memory utilization reached 100%, growing linearly rather than spiking. The systemctl output showed 1,250 foo processes in foo.service:

 1     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/foo.service; static)
 2    Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d
 3             └─00-aws-config.conf
 4     Active: active (running) since Wed 2025-12-10 00:06:08 UTC; 3s ago
 5TriggeredBy: ● foo.socket
 6   Main PID: 8856 (start-foo)
 7      Tasks: 1250 (limit: 18702)
 8     Memory: 3.0G
 9     CGroup: /system.slice/foo.service
10             ├─2114 /usr/bin/start-foo
11             ├─1870 /usr/bin/foo
12             ├─2082 /usr/bin/foo
13             ├─2121 /usr/bin/foo
14....

The journal log confirms that foo.service stops and starts every 5 seconds:

1bash-5.2# journalctl -u foo | grep "Stopped"
2Dec 10 08:47:27 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped foo.
3Dec 10 08:47:33 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped foo.
4Dec 10 08:47:38 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped foo.
5Dec 10 08:47:43 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped foo.
6...

But why weren't previous processes cleaned up? It's because the KillMode is set to process, which means only the main process itself is killed—in this case, the start-foo.

 1[Unit]
 2Description=Setup foo
 3Requires=aws-otel-collector.service
 4
 5[Service]
 6Type=simple
 7ExecStart=/usr/bin/start-foo 
 8Restart=always
 9KillMode=process
10RestartSec=3
11StandardError=journal

What stopped foo.service?

The foo log shows no panic or abnormal crash. The interval between stops is 5s, which differs from "RestartSec=3", so the stop-and-start cycle is unlikely caused by the service's RestartSec directive. The journal log indicates systemd stopped foo.service, but neither the foo.service journal log nor the systemd log explains why.

The culprit is "Requires=aws-otel-collector.service". Requires= means foo.service will be stopped (or restarted) if aws-otel-collector.service is explicitly stopped (or restarted). The journal log shows aws-otel-collector.service stopping every 5 seconds:

1bash-5.2# journalctl -u aws-otel-collector | grep "Stopped"
2Dec 10 08:47:27 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped AWS OTEL collector.
3Dec 10 08:47:33 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped AWS OTEL collector.
4Dec 10 08:47:38 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped AWS OTEL collector.
5Dec 10 08:47:43 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped AWS OTEL collector.
6...

The stop occurs because the configuration file, otel.yaml, is invalid:

1Dec 10 08:49:07 ip-10-194-10-120.us-west-2.compute.internal aws-otel-collector[15696]: Error: failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):
2Dec 10 08:49:07 ip-10-194-10-120.us-west-2.compute.internal aws-otel-collector[15696]: 'service.telemetry.metrics' decoding failed due to the following error(s):
3Dec 10 08:49:07 ip-10-194-10-120.us-west-2.compute.internal aws-otel-collector[15696]: '' has invalid keys: address
4Dec 10 08:49:07 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: aws-otel-collector.service: Main process exited, code=exited, status=1/FAILURE
5Dec 10 08:49:07 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: aws-otel-collector.service: Failed with result 'exit-code'.
6Dec 10 08:49:12 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: aws-otel-collector.service: Scheduled restart job, restart counter is at 21.
7Dec 10 08:49:12 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped AWS OTEL collector.

This is strange since we hadn't upgraded the config yaml recently. The issue arose because we consume aws-otel-collector from bottlerocket-core-kit, which upgraded aws-otel-collector from v0.43.3 to v0.45.1 in v11.1.0. Since we automatically pick up bottlerocket-core-kit updates in CI/CD, we inadvertently brought in the following backward-incompatible change. The following config exports aws-otel-collector's own metrics at port 51681. The syntax is no longer valid in v0.45.1. Removing it resolved the issue for our use case:

1service:
2  telemetry:
3    metrics:
4      address: ":51681"

Conclusion

  1. Spawning a process for a service's daemon is an anti-pattern. If you need a starter process (e.g., start-foo to start foo), don't fork and run foo. Instead, replace the starter process using a syscall like execve, such as Go's unix.Exec.
  2. Use "KillMode=process" with caution. The systemd documentation warns: "If set to process, only the main process itself is killed (not recommended!)." Use KillMode=process only if you're certain spawned processes should outlive the main process. The default "KillMode=control-group" should be used for most services.
  3. Use "Requires=" with caution. The systemd documentation states: "Often, it is a better choice to use Wants= instead of Requires= in order to achieve a system that is more robust when dealing with failing services." This is especially true for critical services (e.g., foo) that shouldn't depend on non-critical services (e.g., aws-otel-collector).
  4. To avoid process leaks, set TasksMax= to a reasonable number. The default TasksMax is 15% of the minimum of kernel.pid_max, kernel.threads-max, and root cgroup pids.max, which can be in the thousands on regular servers.
  5. Having engineers vet each third-party dependency version upgrade is prohibitively expensive for large systems. A more viable solution is catching breaking changes through functional tests in non-prod environments. For example, we added long-running task tests in Gamma.
  6. Systemd doesn't keep records of why a service is stopped, making root cause analysis difficult. This is a limitation of systemd. On a live system, it will briefly show if a service is stopped because its dependencies are not met, but this information is not persisted.