Be careful making thread-aware syscalls in Go: lock the thread
I recently shipped a bug that caused around 0.5% of container workloads to fail to start for one internal customer. It is an interesting mix of Linux namespaces, Go concurrency, and making syscalls in Go. This post walks through the bug and its fix.
The need to run a program in its own network and mount namespace
soci-snapshotter is an open-source containerd snapshotter plugin that enables pulling OCI images
in lazy loading mode.
It downloads the necessary layers to start the container and then downloads the rest of the layers. This has reduced time-to-container-start
by as much as 60%. The container runtime I work on pulls images from a separate ENI (let's call it the task ENI). The runtime waits for the
task ENI to be attached, then moves it to a new network namespace. To use soci-snapshotter, I need to run it in this new network namespace.
What's more, the ENI also comes with its own DNS resolver configuration, so we also need to run soci-snapshotter
in a new mount namespace with its own /etc/resolv.conf. Note that this path is hardcoded in Go's net library—
I cannot configure soci-snapshotter to use a resolv.conf at a different path. Given this constraint, I have the following code:
1import "syscall"
2
3// error handling omitted for clarity.
4func main() {
5 // Step 1. Make /etc/resolv.conf a private mount onto itself in the parent mnt namespace.
6 // Set the mount propagation to be private such that changes to it do not propagate to the parent mnt namespace.
7 syscall.Mount("", "/etc/resolv.conf", "", syscall.MS_PRIVATE, "")
8
9 // Step 2. Copy the parent mnt namespace into a new mnt namespace and keep the propagation from the parent
10 // mount namespace unchanged.
11 syscall.Unshare(syscall.CLONE_NEWNS)
12
13 // Step 3. In the new namespace, mount the correct resolv.conf to /etc/resolv.conf.
14 netns := "some-id"
15 resolvPath := filepath.Join("/etc/netns", netns, "resolv.conf")
16 syscall.Mount(resolvPath, "/etc/resolv.conf", "", syscall.MS_BIND, "")
17
18 // Step 4. Start soci-snapshotter-grpc in the new mount namespace and net namespace.
19 args := []string{
20 "nsenter",
21 "-t", fmt.Sprintf("%d", os.Getpid()),
22 "--mount",
23 fmt.Sprintf("--net=%s", filepath.Join("/var/run/netns", netns)),
24 "/usr/bin/soci-snapshotter-grpc",
25 }
26 syscall.Exec("/usr/bin/nsenter", args, os.Environ())
27}
The bug: Broken /etc/resolv.conf in root mount namespace
After shipping the code to some internal beta customers, we quickly got paged that around 0.5% of their workloads failed to start with errors like the following.
1request send failed, Post \"https://logs.us-east-1.amazonaws.com/\": dial tcp:
2lookup logs.us-east-1.amazonaws.com on 10.1.0.2:53:
3read udp 10.0.11.111:34772->10.1.0.2:53: i/o timeout",
The last line says an EC2 instance in a subnet "10.0.x.x/xx" is trying to use the DNS server at "10.1.0.2", which is in a different VPC. This is wrong because the default name server provided by EC2 is always VPC CIDR base + 2, e.g. (10.1.0.0 + 2 = 10.1.0.2). One explanation is that syscall.Mount in step 3 mounts the resolv.conf intended for the task ENI to /etc/resolv.conf in the root mount namespace. How could this happen?
Thread-specific syscalls
unshare() allows a process (or thread) to disassociate parts of
its execution context that are currently being shared with other processes (or threads). While this sounds complex,
for our purposes we use unshare with CLONE_NEWNS, which creates a new mount namespace and moves the calling process into it.
Fun fact: NEWNS refers to the mount namespace because it was the first namespace type implemented. Another fun fact:
in Linux documentation, "process" is often used interchangeably with "thread" and "task".
Given a pid, the "Tgid" in /proc/{pid}/status indicates the process ID. If a pid represents a real process ID, the Tgid and Pid values are identical.
Many other syscalls are also thread-specific, including mount and setns. The mount syscall changes mount points in the calling
thread's mount namespace, while setns moves the calling thread into a different namespace. The nsenter program used in our
Go code relies on setns.
Go concurrency model: MPG
Go concurrency uses the MPG model.
- Machine (M) represents an OS thread.
- Logical Processor (P) represents an available processing unit.
- Goroutine (G) represents a unit of work. You can see definitions of m, p, and g at runtime/runtime2.go.
Conceptually, the execution of a piece of Go code involves four steps.
- The goroutine is prepared and ready to execute the code.
- The goroutine is scheduled onto a logical processor.
- The logical processor is scheduled onto a machine.
- The operating system schedules a machine to carry out the execution.
For example, the following shows different Go routines run on different OS threads.
1func main() {
2 runtime.GOMAXPROCS(runtime.NumCPU())
3 fmt.Printf("NumCPU: %d\n", runtime.NumCPU())
4
5 var wg sync.WaitGroup
6 for i := range 3 {
7 wg.Add(1)
8 go func(id int) {
9 defer wg.Done()
10 runtime.LockOSThread()
11 defer runtime.UnlockOSThread()
12 fmt.Printf("Goroutine %d - Process PID: %d, Thread ID: %d\n", id, os.Getpid(), unix.Gettid())
13 }(i)
14 }
15 wg.Wait()
16 time.Sleep(100 * time.Second)
17}
1 % ls -lh /proc/1142/task
2total 0
3dr-xr-xr-x 7 peng staff 0 Oct 23 07:32 1142
4dr-xr-xr-x 7 peng staff 0 Oct 23 07:32 1143
5dr-xr-xr-x 7 peng staff 0 Oct 23 07:32 1144
6dr-xr-xr-x 7 peng staff 0 Oct 23 07:32 1145
7dr-xr-xr-x 7 peng staff 0 Oct 23 07:32 1146
8dr-xr-xr-x 7 peng staff 0 Oct 23 07:32 1147
9
10 % cat /proc/1143/status | grep -E "^(Pid|Tgid):"
11Tgid: 1142
12Pid: 1143
13
14 % cat /proc/1142/status | grep -E "^(Pid|Tgid):"
15Tgid: 1142
16Pid: 1142
The demo creates 3 goroutines to get different thread IDs more easily. Even if a Go program doesn't use goroutines explicitly like go foo(), the Go runtime will have additional goroutines beyond the "main" goroutine. For example, garbage collection.
With sequential statements line-1; line-2; where neither uses goroutines, the Go runtime guarantees line-1 "happens before" line-2. However, there's no guarantee that line-1 and line-2 run on the same OS thread. For example:
1func main() {
2 runtime.GOMAXPROCS(runtime.NumCPU())
3 for i := 0; i < 10; i++ {
4 go func() {
5 for {
6 time.Sleep(time.Millisecond * 100)
7 }
8 }()
9 }
10
11 fmt.Println("thread ID =", unix.Gettid())
12 time.Sleep(1 * time.Second)
13 fmt.Println("thread ID =", unix.Gettid())
14}
1 % go run main.go
2thread ID = 17854
3thread ID = 17861
The bug: Unshare, Mount and Exec do not use the same mount namespace
Now that we understand Linux syscalls unshare and mount are thread-specific and the Go runtime schedules code on different OS threads,
the source of our 0.5% bug occurrence becomes clear. The following table shows one possible execution scenario:
| Thread | Code | Mount namespace |
|---|---|---|
| thread-1 | syscall.Unshare(syscall.CLONE_NEWNS) |
created a new mount namespace |
| thread-2 | syscall.Mount(resolvPath, "/etc/resolv.conf", "", syscall.MS_BIND, "") |
root namespace |
| thread-1 | syscall.Exec("nsenter", "-t", fmt.Sprintf("%d", os.Getpid()), "--mount", "soci-snapshotter") |
root namespace |
The syscall.Exec replaces the current process with the soci-snapshotter process. The new process uses the mount namespace of os.Getpid(), which is the root namespace.
Fix it: one line!
Since our Go program uses only one goroutine (the main goroutine), we have a simple fix: lock the main goroutine
to one OS thread. There's no need to call runtime.UnlockOSThread() since unix.Exec replaces the Go process completely with the
soci-snapshotter-grpc program. The fix is indeed just one line!
1import "golang.org/x/sys/unix"
2
3func main() {
4 runtime.LockOSThread()
5
6 unix.Mount("", "/etc/resolv.conf", "", unix.MS_PRIVATE, "")
7 unix.Unshare(unix.CLONE_NEWNS)
8 netns := "some-id"
9 resolvPath := filepath.Join("/etc/netns", netns, "resolv.conf")
10 unix.Mount(resolvPath, "/etc/resolv.conf", "", unix.MS_BIND, "")
11 args := []string{
12 "nsenter",
13 "-t", fmt.Sprintf("%d", unix.Gettid()),
14 "--mount",
15 fmt.Sprintf("--net=%s", filepath.Join("/var/run/netns", netns)),
16 "/usr/bin/soci-snapshotter-grpc",
17 }
18 unix.Exec("/usr/bin/nsenter", args, os.Environ())
19}
You may have noticed that the fixed code no longer uses the syscall package. According to the Go documentation, the syscall package has been deprecated since Go 1.4. For direct syscalls, use the golang.org/x/sys package instead. Learn more at https://golang.org/s/go1.4-syscall.
Are LockOSThread and Gettid usage patterns common?
They're quite common in code dealing with Linux syscalls. Here are two examples from containerd:
1func mountAt(chdir string, source, target, fstype string, flags uintptr, data string) error {
2 // ....
3 ch := make(chan error, 1)
4 go func() {
5 runtime.LockOSThread()
6
7 // Do not unlock this thread.
8 // If the thread is unlocked go will try to use it for other goroutines.
9 // However it is not possible to restore the thread state after CLONE_FS.
10 //
11 // Once the goroutine exits the thread should eventually be terminated by go.
12
13 if err := unix.Unshare(unix.CLONE_FS); err != nil {
14 ch <- err
15 return
16 }
17 // ....
18 }()
19 return <-ch
1// getCurrentThreadNetNSPath copied from pkg/ns
2func getCurrentThreadNetNSPath() string {
3 // /proc/self/ns/net returns the namespace of the main thread, not
4 // of whatever thread this goroutine is running on. Make sure we
5 // use the thread's net namespace since the thread is switching around
6 return fmt.Sprintf("/proc/%d/task/%d/ns/net", os.Getpid(), unix.Gettid())
7}
Better solution: use runc
Running a process with its own namespace effectively means running it as a container. So why not prepare a
runtime bundle and call runc run? In our case, we don't even need a rootfs—just a config.json and an empty rootfs
folder would suffice. This approach works well, but that's a topic for another post.