Lessons Learned from the Bumpy Journey of Socket-Activating soci-snapshotter

While migrating our container runtime OS from Amazon Linux (AL2) to Bottlerocket, we encountered several interesting incidents and errors documented in previous posts: 1, 2, 3. This post describes two more errors. The most notorious leaked a process every 5 seconds, quickly consuming all available CPU and memory, forcing us to roll back.

Start soci-snapshotter on-demand using socket activation

soci-snapshotter is an open-source containerd snapshotter plugin that enables pulling OCI images in lazy loading mode using SOCI index. In our container runtime, we start soci-snapshotter only when an image has a SOCI index. Previously, this on-demand service start used path activation. Once the runtime found a SOCI index, it would write a sentinel file to disk, triggering systemd to start the soci-snapshotter service. About a year ago, soci-snapshotter added support for systemd socket activation by listening on a file descriptor passed via the "--address" flag. Systemd creates and listens on sockets before the service starts. When something connects to the socket, systemd launches the service and hands over the already-open file descriptors.

At a high level:

  1. Given the following socket unit file "soci-snapshotter.socket", systemd creates and listens to /run/soci-snapshotter-grpc/soci-snapshotter-grpc.sock.
1[Unit]
2Description=Monitor soci snapshotter socket file for changes and start snapshotter
3
4[Socket]
5ListenStream=/run/soci-snapshotter-grpc/soci-snapshotter-grpc.sock
6SocketMode=0660
7
8[Install]
9WantedBy=sockets.target
  1. Register the soci plugin in containerd.
1[proxy_plugins]
2[proxy_plugins.soci]
3type = "snapshot"
4address = "/run/soci-snapshotter-grpc/soci-snapshotter-grpc.sock"
  1. The container runtime finds a SOCI index and asks the containerd client to use the soci-snapshotter plugin to pull the image.
  2. The containerd client sends a gRPC request to the socket.
  3. Systemd receives the connection from the socket, starts the service of the same name (soci-snapshotter.service), and passes file descriptors to the main process started by the service.

Mysterious one second latency

Soon after deployment, we observed approximately 1 second of additional image pull latency. Enabling debug log level for containerd showed the following:

12025-11-10T07:46:14.609236015Z level=debug msg="create image" name="xxx.dkr.ecr.us-west-2.amazonaws.com/telemetry:latest@sha256:xxx" target="sha256:xxx""
22025-11-10T07:46:15.174746106Z level=debug msg="prepare view snapshot" key=ping parent= snapshotter=soci"
32025-11-10T07:46:16.177995366Z level=debug msg="prepare view snapshot" key=ping parent= snapshotter=soci"
42025-11-10T07:46:16.184813541Z level=debug msg="prepare view snapshot" key=ping parent= snapshotter=soci"
52025-11-10T07:46:16.194634746Z level=debug msg="prepare view snapshot" key=ping parent= snapshotter=soci"
6... retry every 10ms ... 
72025-11-10T07:46:17.184668850Z level=debug msg="prepare view snapshot" key=ping parent= snapshotter=soci"
82025-11-10T07:46:17.274659068Z level=debug msg="(*service).Write started" expected="sha256:xxx" ref="manifest-sha256:xxx" total=1577"

The "ping" is a Snapshotter.View call with an empty "parent" argument. After about 100 retries, the ping finally succeeds, indicating soci-snapshotter is ready.

Interestingly, soci-snapshotter only saw the last ping request—the first request it received after starting, more than 1 second after launch.

12025-11-10T07:46:15.560288947Z "address":"fd://","level":"info","msg":"soci-snapshotter-grpc successfully started"
22025-11-10T07:46:17.187325450Z "key":"fargate.task/103/ping","level":"debug","msg":"view"

Where did the failed ping requests go?

Note that the daemon software configured for socket activation with socket units needs to be able to accept sockets from systemd, either via systemd's native socket passing interface (see sd_listen_fds for details about the precise protocol used and the order in which the file descriptors are passed) or via traditional inetd(8)-style socket passing (i.e. sockets passed in via standard input and output, using StandardInput=socket in the service file). --- systemd socket activation document

The sd_listen_fds does the following:

  1. Sets environment variables: LISTEN_FDS (number of file descriptors passed) and LISTEN_PID (PID of your process so you can ignore fds meant for others)
  2. Passes the file descriptors starting at FD 3

However, the environment variables and file descriptors are passed to the main process started by soci-snapshotter.service—a Go program called start-soci-snapshotter. This Go program first performs some preparation work, then spawns a new process to actually run soci-snapshotter.

 1func main() {
 2    flag.BoolVar(&inside, "inside", false, "Whether the soci-snapshotter program is running inside the new mount namespace")
 3    flag.Parse()
 4    if !inside {
 5    doSomePreparationWorks()
 6    // spawns a new process to actually run the soci-snapshotter.
 7    cmd := exec.Command(os.Args[0], "-inside")
 8        cmd.SysProcAttr = &unix.SysProcAttr{
 9            Cloneflags: unix.CLONE_NEWNS,
10        }
11        cmd.Run()
12    } else {
13        args := []string{
14            "nsenter",
15            "-t", fmt.Sprintf("%d", os.Getpid()),
16            "--mount",
17            fmt.Sprintf("--net=%s", filepath.Join("/var/run/netns", netns)),
18            "/usr/bin/soci-snapshotter-grpc",
19            "--log-level=debug",
20            "--address=fd://",
21            "--config=/etc/soci-snapshotter-grpc/config.toml",
22        }
23        unix.Exec("/usr/bin/nsenter", args, os.Environ())
24    }
25}

The environment variables and file descriptors are passed to the main process, but exec.Command uses fork() and execve() and does not pass the parent process's environment variables and file descriptors by default. You must explicitly set cmd.ExtraFiles and cmd.Env on the command object. When soci-snapshotter starts in the spawned process, it doesn't find the expected file descriptor, so it removes the socket and rebinds to it. See soci's listenUnix:

1func listenUnix(addr string) (net.Listener, error) {
2    // Try to remove the socket file to avoid EADDRINUSE
3    if err := os.RemoveAll(addr); err != nil {
4        return nil, fmt.Errorf("failed to remove %q: %w", addr, err)
5    }
6    return net.Listen("unix", addr)
7}

Ping requests never reach soci-snapshotter because it deletes the socket systemd created and recreates a new socket at the same path. This explains the retries, but why does the ping request succeed after 1 second?

Containerd uses the default gRPC backoff of 1s with 0.2s jitter. After the 1s backoff, containerd re-dials and connects to the socket that soci-snapshotter now listens on.

Process leak drains available resources

After fixing the bug to pass the environment variables and file descriptors, we tested and confirmed the 1s latency was gone. However, soon after deployment, we encountered a more serious incident where the Linux host became unusable over time. Metrics showed CPU and memory utilization reaching 100%. Even in a test environment where the container only echoed "hello" periodically, CPU and memory utilization grew steadily until reaching 100%. Something was leaking!

The systemctl output showed 1,250 soci-snapshotter processes in soci-snapshotter.service after just a few minutes:

 1     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/soci-snapshotter.service; static)
 2    Drop-In: /x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/service.d
 3             └─00-aws-config.conf
 4     Active: active (running) since Wed 2025-12-10 00:06:08 UTC; 3s ago
 5TriggeredBy: ● soci-snapshotter.socket
 6   Main PID: 8856 (start-soci-snap)
 7      Tasks: 1250 (limit: 18702)
 8     Memory: 3.0G
 9     CGroup: /system.slice/soci-snapshotter.service
10             ├─2114 /usr/bin/start-soci-snapshotter
11             ├─1870 /usr/bin/soci-snapshotter-grpc --log-level=debug --address=fd:// --config=/etc/soci-snapshotter-grpc/config.toml
12             ├─2082 /usr/bin/soci-snapshotter-grpc --log-level=debug --address=fd:// --config=/etc/soci-snapshotter-grpc/config.toml
13             ├─2121 /usr/bin/soci-snapshotter-grpc --log-level=debug --address=fd:// --config=/etc/soci-snapshotter-grpc/config.toml
14....

Each new soci-snapshotter process deletes the socket file the previous process created. The previous process becomes a zombie since no more requests are sent to it. It just hangs there indefinitely, consuming CPU and memory. The journal log confirms that soci-snapshotter.service stops and starts every 5 seconds:

1bash-5.2# journalctl -u soci-snapshotter | grep "Stopped"
2Dec 10 08:47:27 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped Setup soci-snapshotter-grpc.
3Dec 10 08:47:33 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped Setup soci-snapshotter-grpc.
4Dec 10 08:47:38 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped Setup soci-snapshotter-grpc.
5Dec 10 08:47:43 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped Setup soci-snapshotter-grpc.
6...

But why weren't previous processes cleaned up? KillMode is set to process, which means only the main process itself is killed—in this case, the outer Go program start-soci-snapshotter.

 1[Unit]
 2Description=Setup soci-snapshotter-grpc
 3Requires=aws-otel-collector.service
 4
 5[Service]
 6Type=simple
 7ExecStart=/usr/bin/start-soci-snapshotter 
 8Restart=always
 9KillMode=process
10RestartSec=3
11StandardError=journal

What stopped soci-snapshotter.service?

The soci-snapshotter log shows no panic or abnormal crash. The interval between stops is 5s, which differs from "RestartSec=3", so the stop-and-start cycle is unlikely caused by the service's RestartSec directive. The journal log indicates systemd decided to stop soci-snapshotter.service, but neither the soci-snapshotter.service journal log nor the systemd log explains why.

The culprit is "Requires=aws-otel-collector.service". Requires= means soci-snapshotter.service will be stopped (or restarted) if aws-otel-collector.service is explicitly stopped (or restarted). The journal log shows aws-otel-collector.service stopping every 5 seconds:

1bash-5.2# journalctl -u aws-otel-collector | grep "Stopped"
2Dec 10 08:47:27 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped AWS OTEL collector.
3Dec 10 08:47:33 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped AWS OTEL collector.
4Dec 10 08:47:38 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped AWS OTEL collector.
5Dec 10 08:47:43 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped AWS OTEL collector.
6...

The stop occurs because the configuration file, otel.yaml, is invalid:

1Dec 10 08:49:07 ip-10-194-10-120.us-west-2.compute.internal aws-otel-collector[15696]: Error: failed to get config: cannot unmarshal the configuration: decoding failed due to the following error(s):
2Dec 10 08:49:07 ip-10-194-10-120.us-west-2.compute.internal aws-otel-collector[15696]: 'service.telemetry.metrics' decoding failed due to the following error(s):
3Dec 10 08:49:07 ip-10-194-10-120.us-west-2.compute.internal aws-otel-collector[15696]: '' has invalid keys: address
4Dec 10 08:49:07 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: aws-otel-collector.service: Main process exited, code=exited, status=1/FAILURE
5Dec 10 08:49:07 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: aws-otel-collector.service: Failed with result 'exit-code'.
6Dec 10 08:49:12 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: aws-otel-collector.service: Scheduled restart job, restart counter is at 21.
7Dec 10 08:49:12 ip-10-194-10-120.us-west-2.compute.internal systemd[1]: Stopped AWS OTEL collector.

This is strange since we hadn't upgraded the config yaml recently. The issue arose because we consume aws-otel-collector from bottlerocket-core-kit, which upgraded aws-otel-collector from v0.43.3 to v0.45.1 in v11.1.0. Since we automatically pick up bottlerocket-core-kit updates in CI/CD, we inadvertently brought in this backward-incompatible change.

The following config exports aws-otel-collector's own metrics at port 51681. The syntax is no longer valid in v0.45.1. Removing it resolved the issue for our use case:

1service:
2  telemetry:
3    metrics:
4      address: ":51681"

Why did we make soci-snapshotter.service require aws-otel-collector.service? We shouldn't have. The intention was to start otel only when soci starts since we only send soci-snapshotter's metrics to CloudWatch. We should have used "Wants=" instead of "Requires=".

Conclusion

  1. Spawning a new process in a socket-activation daemon is rarely a good idea. If you must do it, ensure environment variables and file descriptors are passed to the spawned processes.
  2. Use "KillMode=process" with caution. The systemd documentation warns: "If set to process, only the main process itself is killed (not recommended!)." Use KillMode=process only if you're certain spawned processes should outlive the main process. The default "KillMode=control-group" should be used for most services.
  3. Use "Requires=" with caution. The systemd documentation states: "Often, it is a better choice to use Wants= instead of Requires= in order to achieve a system that is more robust when dealing with failing services." This is especially true for critical services (e.g., soci-snapshotter) that shouldn't depend on non-critical services (e.g., aws-otel-collector).
  4. To avoid process leaks, set TasksMax= to a reasonable number. The default TasksMax is 15% of the minimum of kernel.pid_max, kernel.threads-max, and root cgroup pids.max, which can be in the thousands on regular servers.
  5. Having engineers vet each third-party dependency version upgrade is prohibitively expensive for large systems. A more viable solution is catching breaking changes through functional tests in non-prod environments. For example, we added long-running SOCI task tests in Gamma.