Be careful making thread-aware syscalls in Go: lock the thread

A bug caused around 0.5% of container workloads to fail to start for one internal customer. This post walks through the bug and its fix—an interesting mix of Linux namespaces, Go concurrency, and syscalls.

The need to run a program in its own network namespace and mount namespace

soci-snapshotter is an open-source containerd snapshotter plugin that enables pulling OCI images in lazy loading mode. It downloads the necessary layers to start the container and then downloads the rest while the container is running. This has reduced time-to-container-start by as much as 60%. The container runtime I work on pulls images from a separate Elastic Network Interface (ENI), called the task ENI. The runtime waits for the task ENI to be attached, then moves it to a new network namespace. To use soci-snapshotter, I need to run it in this new network namespace. Additionally, the ENI comes with its own DNS resolver configuration, so we must run soci-snapshotter in a new mount namespace with its own /etc/resolv.conf. This path is hardcoded in Go's net library—I cannot configure soci-snapshotter to use a resolv.conf at a different path. Given this constraint, I have the following code:

 1import "syscall"
 2
 3// error handling omitted for clarity.
 4func main() {
 5    // Step 1. Make /etc/resolv.conf a private mount onto itself in the parent mnt namespace.
 6    // Set the mount propagation to be private such that changes to it do not propagate to the parent mnt namespace.
 7    syscall.Mount("/etc/resolv.conf", "/etc/resolv.conf", "", syscall.MS_BIND, "")
 8    syscall.Mount("", "/etc/resolv.conf", "", syscall.MS_PRIVATE, "")
 9
10    // Step 2. Copy the parent mnt namespace into a new mnt namespace and keep the propagation from the parent
11    // mount namespace unchanged.
12    syscall.Unshare(syscall.CLONE_NEWNS)
13
14    // Step 3. In the new namespace, mount the correct resolv.conf to /etc/resolv.conf.
15    netns := "some-id"
16    resolvPath := filepath.Join("/etc/netns", netns, "resolv.conf")
17    syscall.Mount(resolvPath, "/etc/resolv.conf", "", syscall.MS_BIND, "")
18
19    // Step 4. Start soci-snapshotter-grpc in the new mount namespace and net namespace.
20    args := []string{
21        "nsenter",
22        "-t", fmt.Sprintf("%d", os.Getpid()),
23        "--mount",
24        fmt.Sprintf("--net=%s", filepath.Join("/var/run/netns", netns)),
25        "/usr/bin/soci-snapshotter-grpc",
26    }
27    syscall.Exec("/usr/bin/nsenter", args, os.Environ())
28}

The bug: Broken /etc/resolv.conf in root mount namespace

After shipping the code to some internal beta customers, we quickly got paged that around 0.5% of their workloads failed to start with errors like the following.

1request send failed, Post \"https://logs.us-east-1.amazonaws.com/\": dial tcp: 
2lookup logs.us-east-1.amazonaws.com on 10.1.0.2:53: 
3read udp 10.0.11.111:34772->10.1.0.2:53: i/o timeout",

The last line says an EC2 instance in a subnet "10.0.x.x/xx" is trying to use the DNS server at "10.1.0.2", which is in a different VPC. This is wrong because the default name server provided by EC2 is always VPC CIDR base + 2, e.g. (10.1.0.0 + 2 = 10.1.0.2). One explanation is that syscall.Mount in step 3 mounts the resolv.conf intended for the task ENI to /etc/resolv.conf in the root mount namespace. How could this happen?

Thread-specific syscalls

unshare() allows a process (or thread) to disassociate parts of its execution context that are currently being shared with other processes (or threads). While this sounds complex, for our purposes we use unshare with CLONE_NEWNS, which creates a new mount namespace and moves the calling process into it. Fun fact: NEWNS refers to the mount namespace because it was the first namespace type implemented. Another fun fact: in Linux documentation, "process" is often used interchangeably with "thread" and "task". Given a pid, the "Tgid" in /proc/{pid}/status indicates the process ID. If a pid represents a real process ID, the Tgid and Pid values are identical.

Many other syscalls are also thread-specific, including mount and setns. The mount syscall changes mount points in the calling thread's mount namespace, while setns moves the calling thread into a different namespace. The nsenter program used in our Go code relies on setns.

Go concurrency model: MPG

Go concurrency uses the MPG model.

  • Machine (M) represents an OS thread.
  • Logical Processor (P) represents an available processing unit.
  • Goroutine (G) represents a unit of work. You can see definitions of m, p, and g at runtime/runtime2.go.

Conceptually, the execution of a piece of Go code involves four steps.

  1. The goroutine is prepared and ready to execute the code.
  2. The goroutine is scheduled onto a logical processor.
  3. The logical processor is scheduled onto a machine.
  4. The operating system schedules a machine to carry out the execution.

For example, the following shows different Go routines run on different OS threads.

 1func main() {
 2    runtime.GOMAXPROCS(runtime.NumCPU())
 3    fmt.Printf("NumCPU:%d\n", runtime.NumCPU())
 4
 5    var wg sync.WaitGroup
 6    for i := range 3 {
 7        wg.Add(1)
 8        go func(id int) {
 9            defer wg.Done()
10            runtime.LockOSThread()
11            defer runtime.UnlockOSThread()
12            fmt.Printf("Goroutine%d - Process PID:%d, Thread ID:%d\n", id, os.Getpid(), unix.Gettid())
13        }(i)
14    }
15    wg.Wait()
16    time.Sleep(100 * time.Second)
17}
 1% ls -lh /proc/1142/task 
 2total 0
 3dr-xr-xr-x 7 peng staff 0 Oct 23 07:32 1142
 4dr-xr-xr-x 7 peng staff 0 Oct 23 07:32 1143
 5dr-xr-xr-x 7 peng staff 0 Oct 23 07:32 1144
 6dr-xr-xr-x 7 peng staff 0 Oct 23 07:32 1145
 7dr-xr-xr-x 7 peng staff 0 Oct 23 07:32 1146
 8dr-xr-xr-x 7 peng staff 0 Oct 23 07:32 1147
 9
10% cat /proc/1143/status | grep -E "^(Pid|Tgid):"
11Tgid:   1142
12Pid:    1143
13
14% cat /proc/1142/status | grep -E "^(Pid|Tgid):"
15Tgid:   1142
16Pid:    1142

The demo creates 3 goroutines to get different thread IDs more easily. Even if a Go program doesn't use goroutines explicitly like go foo(), the Go runtime will have additional goroutines beyond the "main" goroutine. For example, garbage collection.

With sequential statements line-1; line-2; where neither uses goroutines, the Go runtime guarantees line-1 "happens before" line-2. However, there's no guarantee that line-1 and line-2 run on the same OS thread. For example:

 1func main() {
 2    runtime.GOMAXPROCS(runtime.NumCPU())
 3    for i := 0; i < 10; i++ {
 4        go func() {
 5            for {
 6                time.Sleep(time.Millisecond * 100)
 7            }
 8        }()
 9    }
10
11    fmt.Println("thread ID =", unix.Gettid())
12    time.Sleep(1 * time.Second)
13    fmt.Println("thread ID =", unix.Gettid())
14}
1% go run main.go
2thread ID = 17854
3thread ID = 17861

The bug: Unshare, Mount and Exec do not use the same mount namespace

Now that we understand Linux syscalls unshare and mount are thread-specific and the Go runtime schedules code on different OS threads, the source of our 0.5% bug occurrence becomes clear. The following table shows one possible execution scenario:

Thread Code Mount namespace
thread-1 syscall.Unshare(syscall.CLONE_NEWNS) created a new mount namespace
thread-2 syscall.Mount(resolvPath, "/etc/resolv.conf", "", syscall.MS_BIND, "") root namespace
thread-1 syscall.Exec("nsenter", "-t", fmt.Sprintf("%d", os.Getpid()), "--mount", "soci-snapshotter") root namespace

The syscall.Exec replaces the current process with the soci-snapshotter process. The new process uses the mount namespace of os.Getpid(), which is the root namespace.

Fix it: one line!

Since our Go program uses only one goroutine (the main goroutine), we have a simple fix: lock the main goroutine to one OS thread. There's no need to call runtime.UnlockOSThread() since unix.Exec replaces the Go process completely with the soci-snapshotter-grpc program. The fix is indeed just one line!

 1import "golang.org/x/sys/unix"
 2
 3func main() {
 4    runtime.LockOSThread()
 5
 6    syscall.Mount("/etc/resolv.conf", "/etc/resolv.conf", "", unix.MS_BIND, "")
 7    unix.Mount("", "/etc/resolv.conf", "", unix.MS_PRIVATE, "")
 8    unix.Unshare(unix.CLONE_NEWNS)
 9    netns := "some-id"
10    resolvPath := filepath.Join("/etc/netns", netns, "resolv.conf")
11    unix.Mount(resolvPath, "/etc/resolv.conf", "", unix.MS_BIND, "")
12    args := []string{
13        "nsenter",
14        "-t", fmt.Sprintf("%d", unix.Gettid()),
15        "--mount",
16        fmt.Sprintf("--net=%s", filepath.Join("/var/run/netns", netns)),
17        "/usr/bin/soci-snapshotter-grpc",
18    }
19    unix.Exec("/usr/bin/nsenter", args, os.Environ())
20}

You may have noticed that the fixed code no longer uses the syscall package. According to the Go documentation, the syscall package has been deprecated since Go 1.4. For direct syscalls, use the golang.org/x/sys package instead. Learn more at https://golang.org/s/go1.4-syscall.

Are LockOSThread and Gettid usage patterns common?

They're quite common in code dealing with Linux syscalls. Here are two examples from containerd:

  1. mount.mountAt
 1func mountAt(chdir string, source, target, fstype string, flags uintptr, data string) error {
 2    // .... 
 3    ch := make(chan error, 1)
 4    go func() {
 5        runtime.LockOSThread()
 6
 7        // Do not unlock this thread.
 8        // If the thread is unlocked go will try to use it for other goroutines.
 9        // However it is not possible to restore the thread state after CLONE_FS.
10        //
11        // Once the goroutine exits the thread should eventually be terminated by go.
12
13        if err := unix.Unshare(unix.CLONE_FS); err != nil {
14            ch <- err
15            return
16        }
17    // ....
18    }()
19    return <-ch
  1. netns.getCurrentThreadNetNSPath
1// getCurrentThreadNetNSPath copied from pkg/ns
2func getCurrentThreadNetNSPath() string {
3    // /proc/self/ns/net returns the namespace of the main thread, not
4    // of whatever thread this goroutine is running on.  Make sure we
5    // use the thread's net namespace since the thread is switching around
6    return fmt.Sprintf("/proc/%d/task/%d/ns/net", os.Getpid(), unix.Gettid())
7}

Better solution: use runc

Running a process with its own namespace effectively means running it as a container. So why not prepare a runtime bundle and call runc run? In our case, we don't even need a rootfs—just a config.json and an empty rootfs folder would suffice. This approach works well, but that's a topic for another post.

Appendix - Reproduce the bug and the fix

Before releasing the one-line fix runtime.LockOSThread(), I ran the following code mount-nsenter-demo.go to gain more confidence in the fix. It does the following:

  1. Takes a local file path (e.g., /local/hello) and makes it a mount point
  2. In the new mount namespace, mounts a different file (e.g., /local/world) to /local/hello
  3. In the new mount namespace, copies /local/hello to /local/hello-copied

The correct behavior is that after the program finishes, the two files /local/world and /local/hello-copied should have the same content.

 1func main() {
 2    if len(os.Args) != 3 {
 3        log.Fatalf("Usage:%s <helloPath> <worldPath>", os.Args[0])
 4    }
 5
 6    helloPath := os.Args[1]
 7    worldPath := os.Args[2]
 8
 9    runtime.LockOSThread()
10
11    if err := unix.Mount(helloPath, helloPath, "", unix.MS_BIND, ""); err != nil {
12        log.Fatal(err)
13    }
14    if err := unix.Mount("", helloPath, "", unix.MS_PRIVATE, ""); err != nil {
15        log.Fatal(err)
16    }
17
18    if err := unix.Unshare(unix.CLONE_NEWNS); err != nil {
19        log.Fatal(err)
20    }
21
22    if err := unix.Mount(worldPath, helloPath, "", unix.MS_BIND, ""); err != nil {
23        log.Fatal(err)
24    }
25
26    args := []string{
27        "nsenter",
28        "-t", fmt.Sprintf("%d", unix.Gettid()),
29        "--mount",
30        "/usr/bin/cp",
31        helloPath,
32        helloPath + "-copied",
33    }
34    if err := unix.Exec("/usr/bin/nsenter", args, os.Environ()); err != nil {
35        log.Fatalf("failed to exec nsenter:%v", err)
36    }
37}
 1% cat ~/examples/hello
 2hello
 3
 4% cat ~/examples/world
 5world
 6
 7% sudo ./mount-nsenter-demo ~/examples/hello ~/examples/world
 8
 9% cat ~/examples/hello-copied 
10world

Now we use a shell script demo.sh to run the demo 1000 times. With runtime.LockOSThread(), in all demos, /world-{sequence} and /hello-{sequence}-copied are equivalent.

 1% sudo ./demo.sh --work-dir ~/examples
 2Running 1000 tests in /home/peng/examples
 3...
 4=========================================
 5Test Results:
 6  PASSED: 1000
 7  FAILED: 0
 8  TOTAL:  1000
 9=========================================
10All tests passed!

Without runtime.LockOSThread(), there are two types of issues:

1% sudo ./demo.sh --work-dir ~/examples
2Running 1000 tests in /home/peng/examples
3...
4
5Progress: 690/1000 iterations completed (Passed: 682, Failed: 8)
6crashed in the 695th run. 
  1. "FAIL [26]: Content mismatch between /home/peng/examples/hello-26-copied and /home/peng/examples/world-26"
1% cat hello-26-copied 
2hello-26

This is the bug explained in the section "The bug: Unshare, Mount and Exec do not use the same mount namespace."

  1. "nsenter: cannot open /proc/9676/ns/mnt: No such file or directory"

The 695th run failed because nsenter cannot find the namespace of the thread. This looks strange at first, but it's the same Go thread scheduling issue. When nsenter runs, the thread identified by unix.Gettid() has already finished. The Go runtime must have decided the OS thread is no longer needed. This is a lesson: without runtime.LockOSThread(), there is no guarantee a thread running line-1 still exists when running the following line-2.