Ways Go programs die
Our Go programs recently triggered an alarm due to excessive panics. Panic is a Go runtime mechanism that halts execution. It got me thinking about different ways a Go program can die. I don't expect many - not like A Million Ways to Die in the West. In this post, we'll go through the various ways Go programs die. These fall into two categories: voluntarily chooses to die, and involuntarily killed by the Go runtime or the operating system.
The main goroutine stops
Go runs the main() function in the main goroutine, which is scheduled just like any other goroutine. However, the main goroutine is unique: when it has no more work to do (all deferred function have ran), the Go runtime calls exit_group(0), terminating the program, even if other goroutines are still running.
Calls os.Exit()
The os.Exit function makes the syscall exit_group to immediately terminates the program. Fatal level logging in most libraries, such as log.Fatal, calls os.Exit after logging the message.
Out of memory
When there isn't enough memory to grow the stack or allocate on heap, the Go runtime calls exit_group(2). For example,
1// fatal error: runtime: out of memory
2var f [1000000000000]int
3f[0] = 1
In production, it's common to tune GOGC and GOMEMLIMIT to optimize garbage collection for efficient memory management.
Unrecovered panics
Go runtime calls exit_group(2) if any goroutine panics without being recovered. A panic can be triggered by the Go runtime when it is unable to figure out what should happen next. This can result from bugs in the Go runtime or errors in the program. For example, dividing by zero, accessing an index out of range, or creating a slice with negative length. A program can also explicitly call panic() when it does not know how to proceed. For example, if an SDK’s credentials have expired and the program cannot refresh them automatically, it might make sense to panic and restart later, once the environment has refreshed the credentials.
The following is an example of handling certain panics before the program exit:
1type FooPanic struct {
2 // Information of the panic. We use a simple error for example.
3 Err error
4}
5
6func main() {
7 defer func() {
8 if r := recover(); r != nil {
9 if fooPanic, ok := r.(FooPanic); ok {
10 fmt.Printf("paniced in Foo, %v. handling it", fooPanic)
11 os.Exit(1)
12 }
13 // re-panic
14 panic(r)
15 }
16 }()
17
18 // ...
19 if foo() {
20 panic(FooPanic{Err: errors.New("foo error")})
21 }
22}
Segmentation fault
Go is type safe and memory safe - until we step around safety using the unsafe package. For example, the unsafe package allows pointer arithmetic.
1// unexpected fault address 0x78f0e8
2// fatal error: fault
3// [signal SIGSEGV: segmentation violation code=0x1 addr=0x78f0e8 pc=0x46c79c]
4s := "hello"
5ptr := unsafe.Pointer(&s)
6*(*byte)(unsafe.Pointer(uintptr(ptr) + 2)) = 'L'
Not that this code might run without a segfault. The memory address could happens to be valid depends on how the compiler lays out memory.
The following code also causes a segmentation fault, but the Go runtime delivers the segfault through panic, allowing recovery, should you choose.
1// panic: runtime error: invalid memory address or nil pointer dereference
2// [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x466be2]
3var ptr *int
4*ptr = 42
Another less obvious example of nil pointer deference is a nil interface.
1type FooInterface interface {
2 Foo()
3}
4
5func main() {
6 var f FooInterface
7 f.Foo()
8}
Deadlock
When a deadlock occurs and no goroutine can make progress, the Go runtime calls exit_group(2). For example,
1// fatal error: all goroutines are asleep - deadlock!
2ch := make(chan int)
3ch <- 1
4
5// fatal error: all goroutines are asleep - deadlock!
6var mu sync.Mutex
7mu.Lock()
8mu.Lock()
Deadlock can be difficult to debug in production. The Go toolchain provides a race detector with the -race
flag. However, race-enabled binaries can use ten times the CPU and memory, so it is impractical to enable the race detector all the time.
Linux signals
A process can send signals to another process using kill or signal. Each signal has a disposition that determines how the process behaves when it receives the signal. For example, SIGTERM is used to terminate a process. It is common to catch SIGTERM, cleanup resources and exit. For example,
1func main() {
2 sigChan := make(chan os.Signal, 1)
3 // You can also catch other catch-able signals like SIGINT (Ctrl+C). For daemons, though, SIGTERM is enough in most cases.
4 signal.Notify(sigChan, syscall.SIGTERM)
5 <-sigChan
6 fmt.Println("SIGTERM... terminating")
7 fmt.Println("terminated")
8}
SIGKILL is another signal to stop processes. Unlike SIGTERM, SIGKILL cannot be caught. SIGTERM and SIGKILL are often used in sequence. For example, "systemctl stop" sends SIGTERM to the service, followed by SIGKILL if the service hasn't stopped withink TimeoutStopSec. Similarly, in AWS ECS, when a task is stopped, a SIGTERM signal is sent to each container’s entry process, followed by a SIGKILL signal after a default timeout of 30 seconds.