Set GOMAXPROCS for Go programs in containers
Every Go program has a runtime. The runtime implements garbage collection, concurrency, stack management, and other critical features. We can configure the runtime by setting variables. In this post, we will look at GOMAXPROCS, a variable that configures concurrency. You may get free performance boost by setting GOMAXPROCS when running Go in containers.
What is GOMAXPROCS?
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit. https://pkg.go.dev/runtime
There are two keywords: "user-level" and "limit":
- GOMAXPROCS controls number of OS threads executing user-level Go code. Threads used for non user-level code, such as garbage collection, is not directly impacted by GOMAXPROCS.
- GOMAXPROCS is an upper limit. The runtime is not required to create GOMAXPROCS OS threads.
The goroutine models has the following three key concepts.
- Machine (M) represents an OS thread.
- Logical Processor (P) represents an available processing unit.
- Goroutine (G) represents a unit of work.
Conceptually, the execution of a piece of Go code involves four steps. First, the goroutine is prepared and ready to execute the code. Then the goroutine is scheduled onto a logical processor. Then the logical processor is scheduled onto a machine. Finally, the operating system schedules a machine to carry out the execution. You can see definitions of m, p, and g at runtime/runtime2.go.
In the Machine-Processor-Goroutine model, it's clear that GOMAXPROCS controls Machine. In Go 1.5, GOMAXPROCS is defaulted to the number of cores. Prior to Go 1.5, GOMAXPROCS is defaulted to 1.
GOMAXPROCS in containers
When a Go program runs in a container, the runtime.NumCPU() reports the number of cores on the host OS. For instance, if the server has 32 vCPUs, GOMAXPROCS is set to 32. The Go runtime tries to utilize 32 vCPUs, even when the CPU limit of the container is set much smaller than 32. The mismatch could negatively impact performance.
Let's run some experiments. All experiments in this post run on my laptop: MacOS 13.6, Apple M1 Max (10 cores), Go 1.21.4 and Docker Desktop 4.24.6.
Experiment 1: number of OS threads.
I created an example Go program, mpg/main.go, that spawns 10K goroutines in three groups. One third of goroutines sum integers, sleep a milisecond, and repeat. One third of goroutines sleep for 1 minute. One third of goroutines sum random integers. I started the Go program and used gops to get number of OS threads. See mpg/run.sh for details.
Experiment 2: impact of busy goroutines
The busyworker/main.go runs two group of goroutines. One group goroutine sums random integers in a busy loop. The other group does not do heavy CPU work. I ran it with a limit of 0.2 vCPU in Docker. For each GOMAXPROCS setting, I ran it three times and calculate average time taken to complete busy goroutine group. In the following diagram, busyWorkers is the number of goroutines of the first group, and otherWorkers is the number of goroutines of the other group. See busyworker/run.sh for details.
Conclusion
Go recommends setting GOMAXPROCS to the number of cores available. In container environments, however, the default number of cores detected by the Go runtime is often larger than the number of cores available to the container. We show that setting GOMAXPROCS to number of cores available to the container yield significant performance improvements.
In this post, I set GOMAXPROCS through environment variables. You can also set it in the application code, using runtime.GOMAXPROCS() or automaxprocs. Take experiment results in this post as an encouraging initial finding. I ran experiments on MacOS, Apple Silicon, Docker, which is nothing like the production environment Go containers are deployed. Like all performance tunings, always verify in real servers with real workload.