Sharp edges of errgroup: Lessons from an errgroup and Context mishap

A recent faulty release disrupted service for some customers. The root cause was a concurrency bug involving x/sync/errgroup and context cancellation. This post shares three practices we learned from the incident. These practices will help us catch similar issues during code review or alert us to problems in production.

What does the buggy code do?

I've simplified the program for this post as follows:

  1. One director manages three managers
  2. Each manager processes tasks periodically
  3. The director maintains a control session with an external system
  4. The director monitors for anomalies. When detected, it sends an important message to each manager and terminates the control session

The buggy/main.go code is below. Run it in the Go Playground.

  1package main
  2
  3import (
  4  "context"
  5  "fmt"
  6  "math/rand"
  7  "time"
  8
  9  "golang.org/x/sync/errgroup"
 10)
 11
 12func main() {
 13  group, ctx := errgroup.WithContext(context.Background())
 14
 15  // set up 3 managers.
 16  managers := make([]*Manager, 3)
 17  for i := range 3 {
 18    managers[i] = &Manager{
 19      id:     fmt.Sprintf("m-%d", i),
 20      ctx:    ctx,
 21      taskCh: make(chan func(), 10),
 22      stopCh: make(chan func()),
 23    }
 24    // simulate tasks each manager needs to complete.
 25    for range 5 {
 26      go func() {
 27        managers[i].taskCh <- func() {
 28          time.Sleep(time.Duration(rand.Intn(2)) * time.Second)
 29        }
 30      }()
 31    }
 32  }
 33
 34  // set up the director
 35  d := &Director{
 36    ctx:            ctx,
 37    group:          group,
 38    controlSession: &ControlSession{},
 39    managers:       managers,
 40  }
 41
 42  d.Start()
 43}
 44
 45type Director struct {
 46  ctx            context.Context
 47  group          *errgroup.Group
 48  managers       []*Manager
 49  controlSession *ControlSession
 50}
 51
 52func (d *Director) Start() {
 53  for _, m := range d.managers {
 54    m := m
 55    d.group.Go(func() error {
 56      m.Start()
 57      return nil
 58    })
 59  }
 60
 61  ctx, cancel := context.WithCancel(d.ctx)
 62  d.group.Go(func() error {
 63    d.monitorAnomaly(ctx, cancel)
 64    return nil
 65  })
 66
 67  d.group.Go(func() error {
 68    return d.controlSession.Start(ctx)
 69  })
 70
 71  doneCh := make(chan error)
 72  go func() {
 73    doneCh <- d.group.Wait()
 74  }()
 75
 76  for {
 77    select {
 78    case <-d.ctx.Done():
 79      fmt.Println("Stopped director: context canceled")
 80      return
 81    case err := <-doneCh:
 82      fmt.Printf("Stopped director: %v\n", err)
 83      return
 84    }
 85  }
 86}
 87
 88func (d *Director) monitorAnomaly(ctx context.Context, cancel context.CancelFunc) {
 89  ticker := time.NewTicker(time.Second)
 90  defer ticker.Stop()
 91
 92  for {
 93    select {
 94    case <-ticker.C:
 95      if d.anomalyDetected() {
 96        fmt.Println("Anomaly detected.")
 97        cancel()
 98      }
 99    case <-ctx.Done():
100      d.notifyManagers()
101      return
102    }
103  }
104}
105
106func (d *Director) notifyManagers() {
107  for _, m := range d.managers {
108    m := m
109    m.stopCh <- func() {
110      fmt.Printf("Important message for %s!\n", m.id)
111    }
112  }
113}
114
115func (d *Director) anomalyDetected() bool {
116  return true
117}
118
119// Manager is an actor that processes actions from the actionCh channel.
120type Manager struct {
121  ctx context.Context
122
123  id string
124
125  // a buffered channel for regular tasks
126  taskCh chan func()
127
128  // a unbuffered channel for the function to execute before exit
129  stopCh chan func()
130}
131
132func (m *Manager) Start() {
133  for {
134    select {
135    case f := <-m.taskCh:
136      f()
137    case f := <-m.stopCh:
138      f()
139      fmt.Printf("Stopped manager %s: stop function received.\n", m.id)
140      return
141    case <-m.ctx.Done():
142      fmt.Printf("Stopped manager %s: context canceled.\n", m.id)
143      return
144    }
145  }
146}
147
148// ControlSession represents a session to the control plane.
149type ControlSession struct{}
150
151func (s *ControlSession) Start(ctx context.Context) error {
152  for {
153    select {
154    // For simplicity, we leave out activities of the session.
155    case <-ctx.Done():
156      fmt.Println("Stopped the control session.")
157      return ctx.Err()
158    }
159  }
160}

When running this code, not all managers receive the important messages. Most often, no manager receives the message; occasionally, one does. This non-deterministic behavior results from Go's select statement scheduling.

1Anomaly detected.
2Stopped the control session.
3Stopped director: context canceled
1Anomaly detected.
2Important message for m-0!. Anomaly Detected.
3Stopped manager m-0: stop function received.
4Stopped the control session.
5Stopped director: context canceled

Have you spotted the bug?

The code uses an errgroup with context cancellation to manage goroutines. If any goroutine returns an error, all other goroutines are canceled. Otherwise, the errgroup waits for all goroutines to complete. The bug lies in two places:

1if d.anomalyDetected() {
2  fmt.Println("Anomaly detected.") // line 97
3  cancel()
4}
5
6d.group.Go(func() error {
7  return d.controlSession.Start(ctx) // line 68
8})

When an anomaly is detected, line 97 cancels the context used for the control session. This causes line 68 to return an error (context.Canceled), which triggers errgroup to cancel all goroutines. Consequently, manager goroutines may not have a chance to process the important message from the director.

How to fix the bug with minimal changes?

The simplest fix is to modify the control session goroutine to return nil when its context is canceled. Additionally, I addressed a code smell: passing both the context and cancelFunc to a callee. The caller who owns the context should handle cancellation rather than delegating it to the callee. Below is the diff; you can view
the full code at fixed/main.go. Run it in the Go Playground.

 1diff -u buggy/main.go fixed/main.go
 2--- buggy/main.go  2025-03-23 19:51:26
 3+++ fixed/main.go  2025-03-23 19:58:31
 4@@ -60,12 +60,14 @@
 5 
 6   ctx, cancel := context.WithCancel(d.ctx)
 7   d.group.Go(func() error {
 8-    d.monitorAnomaly(ctx, cancel)
 9+    defer cancel()
10+    d.monitorAnomaly()
11     return nil
12   })
13 
14   d.group.Go(func() error {
15-    return d.controlSession.Start(ctx)
16+    d.controlSession.Start(ctx)
17+    return nil
18   })
19 
20   doneCh := make(chan error)
21@@ -85,7 +87,7 @@
22   }
23 }
24 
25-func (d *Director) monitorAnomaly(ctx context.Context, cancel context.CancelFunc) {
26+func (d *Director) monitorAnomaly() {
27   ticker := time.NewTicker(time.Second)
28   defer ticker.Stop()
29 
30@@ -94,11 +96,9 @@
31     case <-ticker.C:
32       if d.anomalyDetected() {
33         fmt.Println("Anomaly detected.")
34-        cancel()
35+        d.notifyManagers()
36+        return
37       }
38-    case <-ctx.Done():
39-      d.notifyManagers()
40-      return
41     }
42   }
43 }

The fixed version produces this log output, showing that each manager now handles the important message:

1Anomaly detected.
2Important message for m-0!. Anomaly Detected.
3Stopped manager m-0: stop function received.
4Stopped the control session.
5Important message for m-1!. Anomaly Detected.
6Stopped manager m-1: stop function received.
7Important message for m-2!. Anomaly Detected.
8Stopped manager m-2: stop function received.
9Stopped director: context canceled

Lessons we learned

Lesson 1: Using errgroup effectively

Using errgroup is convenient for small scopes and short-lived goroutines, such as synchronizing multiple API calls. However, managing long-running goroutines in a large scope with errgroup requires careful handling of context cancellation and error propagation.

First, errgroup behavior depends on how it's defined and configured:

  1. g := new(errgroup.Group): g.Wait() blocks until all functions passed to g.Go() return, then returns the first non-nil error (if any)
  2. g, ctx := errgroup.WithContext(ctx): In addition to blocking and returning the first non-nil error (if any), it also cancels the ctx immediately when any function returns an error. Often, canceling the context means canceling all other goroutines in the errgroup
  3. g.SetLimit(n): Limits the number of active goroutines in the group to at most n. In this case, you might want to use g.TryGo() instead of g.Go() to avoid unexpected blocking

Second, it's often clearer to synchronize long-running goroutines using Go primitives. errgroup itself is implemented in fewer than 120 lines using sync.WaitGroup, sync.Once, and channels. For long-running daemons, explicit synchronization with channels, goroutines, and sometimes sync.WaitGroup is preferable. Using errgroup for such cases can make the code hard to read and does not save lines of code anyway. For example, you cannot directly select from errgroup's Wait(); you need workarounds like this:

1ch := make(chan error)
2go func() {
3  ch <- g.Wait()
4}()
5select {
6case err := <-ch:
7  // do something 
8}

Lesson 2: Avoid passing both Context and cancelFunc

It's almost always an anti-pattern to pass both a context.Context and its associated cancelFunc as parameters to a function.

  1. Separation of Concerns: The cancelFunc should typically be managed by the caller or scope that created the context. Passing it down blurs separation of concerns, as the function shouldn't be responsible for canceling a context it didn't create.

  2. Unexpected Behavior: If a function receives a cancelFunc, it might call it prematurely, causing unexpected behavior in the caller or other parts still using the same context. This is exactly what caused the disruption for our customers.

While rare cases may require passing both, clearly document such use cases.

Lesson 3: Avoid Storing context.Context in Structs

Storing a context.Context in a struct often leads to poorly designed concurrency code. In this example, both the Director and Manager store a context as a field. The Go documentation for context states:

Do not store Contexts inside a struct type; instead, pass a Context explicitly to each function that needs it.

Contexts are a concurrency pattern introduced for three purposes:

  1. Passing request-scoped values
  2. Propagating cancellation signals to all goroutines involved in one request
  3. Enforcing deadlines across API boundaries for all goroutines involved in one request

From my experience:

  1. If a function needs a context, pass it as the first parameter idiomatically. If a function needs multiple contexts, the function or design likely needs refactoring. Contexts are hierarchical, so it's unusual for a function to operate at multiple hierarchies
  2. If a struct has a context field, how is it initialized? For example, New(ctx context.Context, ...) suggests the New function needs to handle cancellation, which is confusing
  3. If a struct has a context field, does the context apply to all its methods or just some? Some methods likely don't care about the context. Passing context explicitly to methods that need it is clearer
  4. Storing a context in a struct makes testing more difficult. Passing a context explicitly to methods during testing provides greater flexibility and clarity

Therefore, for Go code that doesn't need backward compatibility with pre-context APIs, storing a context in a struct is almost always an anti-pattern. Learn more in Contexts and structs.