Lessons from an errgroup and Context mishap

A recent faulty release disrupted service for some of our customers. The root cause was a bug related to handling Go concurrency using x/sync/errgroup and context cancellation. In this post, I'll share three good practices in Go that we learned from the incident. These practices would either alert us to similar issues in the future or help us catch them during code review.

What does the buggy code do?

For the purpose of this post, I've simplified the program as follows.

  1. There is one director that manages three managers.
  2. Each manager has tasks to complete from time to time.
  3. The director maintains a control session with an external system.
  4. The director monitors the system for anomalies. When an anomaly is detected, the director sends an important message to each manager and terminates the control session.

The full buggy/main.go code is as follows. Run it in the Go Playground.

  1package main
  2
  3import (
  4  "context"
  5  "fmt"
  6  "math/rand"
  7  "time"
  8
  9  "golang.org/x/sync/errgroup"
 10)
 11
 12func main() {
 13  group, ctx := errgroup.WithContext(context.Background())
 14
 15  // set up 3 managers.
 16  managers := make([]*Manager, 3)
 17  for i := range 3 {
 18    managers[i] = &Manager{
 19      id:     fmt.Sprintf("m-%d", i),
 20      ctx:    ctx,
 21      taskCh: make(chan func(), 10),
 22      stopCh: make(chan func()),
 23    }
 24    // simulate tasks each manager needs to complete.
 25    for range 5 {
 26      go func() {
 27        managers[i].taskCh <- func() {
 28          time.Sleep(time.Duration(rand.Intn(2)) * time.Second)
 29        }
 30      }()
 31    }
 32  }
 33
 34  // set up the director
 35  d := &Director{
 36    ctx:            ctx,
 37    group:          group,
 38    controlSession: &ControlSession{},
 39    managers:       managers,
 40  }
 41
 42  d.Start()
 43}
 44
 45type Director struct {
 46  ctx            context.Context
 47  group          *errgroup.Group
 48  managers       []*Manager
 49  controlSession *ControlSession
 50}
 51
 52func (d *Director) Start() {
 53  for _, m := range d.managers {
 54    m := m
 55    d.group.Go(func() error {
 56      m.Start()
 57      return nil
 58    })
 59  }
 60
 61  ctx, cancel := context.WithCancel(d.ctx)
 62  d.group.Go(func() error {
 63    d.monitorAnomaly(ctx, cancel)
 64    return nil
 65  })
 66
 67  d.group.Go(func() error {
 68    return d.controlSession.Start(ctx)
 69  })
 70
 71  doneCh := make(chan error)
 72  go func() {
 73    doneCh <- d.group.Wait()
 74  }()
 75
 76  for {
 77    select {
 78    case <-d.ctx.Done():
 79      fmt.Println("Stopped director: context canceled")
 80      return
 81    case err := <-doneCh:
 82      fmt.Printf("Stopped director: %v\n", err)
 83      return
 84    }
 85  }
 86}
 87
 88func (d *Director) monitorAnomaly(ctx context.Context, cancel context.CancelFunc) {
 89  ticker := time.NewTicker(time.Second)
 90  defer ticker.Stop()
 91
 92  for {
 93    select {
 94    case <-ticker.C:
 95      if d.anomalyDetected() {
 96        fmt.Println("Anomaly detected.")
 97        cancel()
 98      }
 99    case <-ctx.Done():
100      d.notifyManagers()
101      return
102    }
103  }
104}
105
106func (d *Director) notifyManagers() {
107  for _, m := range d.managers {
108    m := m
109    m.stopCh <- func() {
110      fmt.Printf("Important message for %s!\n", m.id)
111    }
112  }
113}
114
115func (d *Director) anomalyDetected() bool {
116  return true
117}
118
119// Manager is an actor that processes actions from the actionCh channel.
120type Manager struct {
121  ctx context.Context
122
123  id string
124
125  // a buffered channel for regular tasks
126  taskCh chan func()
127
128  // a unbuffered channel for the function to execute before exit
129  stopCh chan func()
130}
131
132func (m *Manager) Start() {
133  for {
134    select {
135    case f := <-m.taskCh:
136      f()
137    case f := <-m.stopCh:
138      f()
139      fmt.Printf("Stopped manager %s: stop function received.\n", m.id)
140      return
141    case <-m.ctx.Done():
142      fmt.Printf("Stopped manager %s: context canceled.\n", m.id)
143      return
144    }
145  }
146}
147
148// ControlSession represents a session to the control plane.
149type ControlSession struct{}
150
151func (s *ControlSession) Start(ctx context.Context) error {
152  for {
153    select {
154    // For simplicity, we leave out activities of the session.
155    case <-ctx.Done():
156      fmt.Println("Stopped the control session.")
157      return ctx.Err()
158    }
159  }
160}

When we run the code, not all managers receive the important messages. Most often, no manager receives the message; occasionally, one manager does. This non-determinism comes from the non-determinism of the select statement in Go.

1Anomaly detected.
2Stopped the control session.
3Stopped director: context canceled
1Anomaly detected.
2Important message for m-0!. Anomaly Detected.
3Stopped manager m-0: stop function received.
4Stopped the control session.
5Stopped director: context canceled

Have you spotted the bug?

The code uses an errgroup with context cancellation to manage goroutines. If any goroutine returns an error, all other goroutines are canceled. Otherwise, the errgroup wait for all goroutines to complete. The bug lies at two places:

1if d.anomalyDetected() {
2  fmt.Println("Anomaly detected.") // line 97
3  cancel()
4}
5
6d.group.Go(func() error {
7  return d.controlSession.Start(ctx) // line 68
8})

When an anomaly is detected, line 97 cancels the context used for the control session. This causes line 68 to return an error (context.Canceled), which in turn causes all goroutines in the errgroup to stop. As a result, the manager goroutines may not have a chance to process the important message from the director.

How to fix the bug with minimal changes?

The simplest fix is to modify the control session goroutine to return nil if its context is canceled. Additionally, I addressed a code smell: passing both the context and cancelFunc to a callee. It's almost always better for the caller - who owns the context - to handle context cancellation rather than delegating it to the callee. Below is the diff, and you can view
the full code at fixed/main.go. Run it in Go Playground.

 1diff -u buggy/main.go fixed/main.go
 2--- buggy/main.go  2025-03-23 19:51:26
 3+++ fixed/main.go  2025-03-23 19:58:31
 4@@ -60,12 +60,14 @@
 5 
 6   ctx, cancel := context.WithCancel(d.ctx)
 7   d.group.Go(func() error {
 8-    d.monitorAnomaly(ctx, cancel)
 9+    defer cancel()
10+    d.monitorAnomaly()
11     return nil
12   })
13 
14   d.group.Go(func() error {
15-    return d.controlSession.Start(ctx)
16+    d.controlSession.Start(ctx)
17+    return nil
18   })
19 
20   doneCh := make(chan error)
21@@ -85,7 +87,7 @@
22   }
23 }
24 
25-func (d *Director) monitorAnomaly(ctx context.Context, cancel context.CancelFunc) {
26+func (d *Director) monitorAnomaly() {
27   ticker := time.NewTicker(time.Second)
28   defer ticker.Stop()
29 
30@@ -94,11 +96,9 @@
31     case <-ticker.C:
32       if d.anomalyDetected() {
33         fmt.Println("Anomaly detected.")
34-        cancel()
35+        d.notifyManagers()
36+        return
37       }
38-    case <-ctx.Done():
39-      d.notifyManagers()
40-      return
41     }
42   }
43 }

Here is the log output. As you can see, each manager now handles the important message.

1Anomaly detected.
2Important message for m-0!. Anomaly Detected.
3Stopped manager m-0: stop function received.
4Stopped the control session.
5Important message for m-1!. Anomaly Detected.
6Stopped manager m-1: stop function received.
7Important message for m-2!. Anomaly Detected.
8Stopped manager m-2: stop function received.
9Stopped director: context canceled

Lessons we learned

Lesson 1: Using errgroup effectively

Using errgroup is convenient for small scopes and short-lived goroutines, such as synchronize multiple API calls. However, managing long-running goroutines in a large scope with errgroup requires careful handling of context cancellation and error propagation.

First, errgroup is polymorphic! Its behavior depends on how it's defined and configured. For example:

  1. g := new(errgroup.Group): g.Wait() blocks until all functions passed to g.Go() return, then returns the first non-nil error (if any).
  2. g, ctx := errgroup.WithContext(ctx): In addition to blocking and returning the first non-nil error (if any), g also cancels the ctx immediately when any function returns an error. Often, canceling the context means canceling all other goroutines in the errgroup.
  3. g.SetLimit(n): Limits the number of active goroutines in the group to at most n. In this case, you might want to use g.TryGo() instead of g.Go() to avoid unexpected blocking.

Second, it's often clearer to synchronize long-running goroutines using Go primitives. errgroup itself is implemented in fewer than 120 lines using sync.WaitGroup, sync.Once, and channels. For long-running daemons, explicit synchronization with channels, native "go" routines, and sometimes sync.WaitGroup is preferable. Using errgroup for such cases can make the code hard to read, and not saving lines of code anyway. For example, you cannot directly select from the errgroup's Wait(), you need to use workarounds like this.

1ch := make(chan error)
2go func() {
3  ch <- g.Wait()
4}
5select {
6  case err := <- ch: // do something 
7}

Lesson 2: Avoid passing both Context and cancelFunc

It's almost always an anti-pattern to pass both a context.Context and its associated cancelFunc as parameters to a function.

  1. Separation of Concerns: The cancelFunc is used to cancel the context and should typically be managed by the caller or the scope that created the context. Passing it down to a function blurs the separation of concerns, as the function should not be responsible for canceling a context it didn't create.

  2. Unexpected Behavior: If a function receives a cancelFunc, it might be tempted to call it, leading to premature cancellation of the context. This can cause unexpected behavior in the caller or other parts still using the same context. This is exactly what caused the disruption for our customers.

While there may be rare cases where passing both is necessary, ensure such use cases are clearly documented.

Lesson 3: Avoid Storing context.Context in Structs

Storing a context.Context in a struct often leads to poorly designed concurrency code. As seen in this example, both the Director and Manager store a context as a field. The Go documentation for context states:

Do not store Contexts inside a struct type; instead, pass a Context explicitly to each function that needs it.

Contexts are a concurrency pattern introduced for three purposes:

  1. Passing request-scoped values.
  2. Propagating cancellation signals to all goroutines involved in one request.
  3. Enforcing deadlines across API boundaries to all the goroutines involved in one request.

In my experience:

  1. If a function needs a context, it's idiomatic to pass it as the first parameter. If a function needs more than one context, it’s a sign that the function or design likely needs refactoring. Contexts are hierarchical, so it’s unusual for a function to operate at multiple hierarchies.
  2. If a struct has a context field, how is it initialized? For example, "New(ctx context.Context, ...)" suggests that the New function needs to handle cancellation, which is confusing.
  3. If a struct has a context filed, does the context apply to all its methods or just some? It's likely that some methods don't care about the context. It is less clear than passing context to methods that need a context, isn't it?
  4. Storing a context in a struct makes testing more difficult. Passing a context explicitly to methods during testing provides greater flexibility and clarity.

Therefore, for Go code that doesn't need to preserve backwards compatibility with pre-context APIs, storing a context in a struct is almost always an anti-pattern. Learn more in the Contexts and structs.