Who killed my service: collecting kernel kill logs with OTEL

Mar 10, 2026 · 6 min read · Linux ·

Share on:

We run a container platform. For privacy and security reasons, we do not collect kernel logs because customer workloads use the same kernel as the host and kernel messages can contain sensitive customer data, such as command-line arguments surfaced in audit logs. However, we recently hit a blind spot: foo.service was killed with no trace in its own logs or systemd logs. No error, no panic, no graceful shutdown message. The process was just gone. Our suspicion pointed to the kernel, but we had no evidence. This post shows how to close the gap by collecting selected kernel kill logs.

When Does the Kernel Kill a Process?

The kernel protects the system and it will kill user space processes that either violated a safety rule or the system has run out of a physical resource needed to stay stable. Here are some common causes of a kernel kill.

Out of Memory (OOM) Killer: When RAM and swap are fully exhausted, the kernel picks a victim using an OOM score and kills it to prevent a total system hang. This is the most common cause of mysterious process deaths.

Log signal: Out of memory: Killed process ...

Memory cgroup limit: In containerized environments, each container typically has a memory cgroup limit. When a container exceeds its limit, the kernel's cgroup OOM killer fires even if the host has plenty of free memory.

Log signal: Memory cgroup out of memory: Killed process ...

Segmentation Fault: A process accesses memory it doesn't own, or writes to a read-only region. The CPU raises a hardware exception, the kernel catches it, and the process is terminated with SIGSEGV. Note that the Go runtime captures SIGSEGV and turns it into a panic. See Default behavior of signals in Go programs. JVM also does similar: catch SIGSEGV and throw NPE, see Handle Signals and Exceptions.

Note that the Go runtime does not catch the SIGSEGV if CGO is enabled and the segfault occurs in C code. Similar to JVM, segfault in JNI are not caught either.

Log signal: segfault at ... ip ... sp ... error 4

Illegal Instruction: A process executes a CPU instruction it doesn't understand, often from a binary compiled for the wrong architecture, or a corrupted executable.

Log signal: traps: invalid opcode ...

Kernel Oops via a Driver: A system call triggers a bug in a kernel driver. The kernel tries to stay alive but kills the offending process to stabilize the system.

The common thread: none of these show up in the application's own logs. The process is gone before it can write anything. The evidence lives exclusively in the kernel ring buffer, accessible via dmesg or journalctl -k.

The Config Change

You can use OpenTelemetry Collector journald receiver to collect kernel logs. However, the receiver needs to be separate from the service receiver. For example, in the following, there is one receiver for systemd service (journalctl -u) and another receiver for kernel log (journalctl -k).

 1receivers:
 2  journald/service:
 3    directory: /var/log/journal
 4    units:
 5      - foo
 6    priority: info
 7    storage: file_storage
 8    start_at: beginning
 9
10  journald/kernel:
11    directory: /var/log/journal
12    matches:
13      - _TRANSPORT: kernel
14    grep: '(?i)(out of memory|killed process|memory cgroup out of memory|segfault at|traps: invalid opcode|general protection)'
15    priority: debug
16    start_at: beginning

A few things worth explaining.

Why `priority: debug` for the kernel receiver?

Kernel messages don't follow the same severity conventions as application logs. Critical events like OOM kills are often emitted at debug or info priority in journald's view, not err or crit. If you set priority: info or higher, you'll silently drop the exact messages you're looking for. Setting priority: debug ensures the receiver reads all kernel messages and lets the grep filter do the real work of keeping only what's relevant.

Why a separate receiver?

The units filter and matches filter in the journald receiver are ANDed together when both are specified. That means combining units: [foo] with matches: [{_TRANSPORT: kernel}] would look for kernel messages that also belong to the foo unit, which is contradictory. Kernel messages don't have a unit. There's also no OR logic at the receiver level. Multiple entries within matches are ANDed as well. So if you want "service logs OR kernel logs", you need two receivers. The only alternative would be to drop all filtering and read everything from journald, then use a filter processor downstream, but that means ingesting far more data than you need, which defeats the purpose.

Each journald receiver is a separate journalctl process

OTEL doesn't read journald logs directly; it spawns a journalctl process per receiver. This also means, to use the grep filter in the receiver, systemd v239+ is required.

 1bash-5.3# systemctl status otelcol
 2● otelcol.service - Executes open telemetry collector to send logs.
 3     Loaded: loaded (/x86_64-bottlerocket-linux-gnu/sys-root/usr/lib/systemd/system/otelcol.service; static)
 4     Active: active (running) since Sat 2026-03-07 00:07:51 UTC; 3 days ago
 5   Main PID: 2341 (otelcol)
 6      Tasks: 12 (limit: 9256)
 7     Memory: 25.5M
 8        CPU: 24min 5.574s
 9     CGroup: /system.slice/otelcol.service
10             ├─2341 /usr/sbin/otelcol 
11             ├─2348 journalctl --utc --output=json --follow --no-tail --unit foo --priority info --directory /var/log/journal --after-cursor xxxxx
12             └─2349 journalctl --utc --output=json --follow --no-tail --priority debug --directory /var/log/journal _TRANSPORT=kernel --grep=xxxx

Appendix: Examples that trigger kernel kill

Trigger OOM (scenario-oom.service)

 1[Unit]
 2Description=Demo Scenario - OOM Kill
 3After=network.target
 4
 5[Service]
 6Type=simple
 7MemoryMax=32M
 8# tail /dev/zero reads zeros endlessly into memory — hits the limit in seconds
 9ExecStart=/bin/bash -c 'echo "Eating memory..."; tail /dev/zero'
10Restart=no
11
12[Install]
13WantedBy=multi-user.target

Trigger segfault

Go version

 1package main
 2
 3/*
 4// Dereference NULL in C — bypasses Go's runtime signal handler,
 5// so the CPU fault reaches the kernel and appears in journalctl -k:
 6//   "segfault at 0 ip ... error 4 in segfault[...]"
 7#include <stdlib.h>
 8void trigger_segfault() {
 9    volatile int *p = NULL;
10    *p = 1;
11}
12*/
13import "C"
14
15func main() {
16    C.trigger_segfault()
17}

Python version

1import ctypes
2print("Triggering segfault via null pointer dereference...")
3ctypes.string_at(0)

Trigger sigill