Sharp edges of errgroup: Lessons from an errgroup and Context mishap
A recent faulty release disrupted service for some customers. The root cause was a concurrency bug involving x/sync/errgroup and context cancellation. This post shares three practices we learned from the incident. These practices will help us catch similar issues during code review or alert us to problems in production.
What does the buggy code do?
I've simplified the program for this post as follows:
- One director manages three managers
- Each manager processes tasks periodically
- The director maintains a control session with an external system
- The director monitors for anomalies. When detected, it sends an important message to each manager and terminates the control session
The buggy/main.go code is below. Run it in the Go Playground.
1package main
2
3import (
4 "context"
5 "fmt"
6 "math/rand"
7 "time"
8
9 "golang.org/x/sync/errgroup"
10)
11
12func main() {
13 group, ctx := errgroup.WithContext(context.Background())
14
15 // set up 3 managers.
16 managers := make([]*Manager, 3)
17 for i := range 3 {
18 managers[i] = &Manager{
19 id: fmt.Sprintf("m-%d", i),
20 ctx: ctx,
21 taskCh: make(chan func(), 10),
22 stopCh: make(chan func()),
23 }
24 // simulate tasks each manager needs to complete.
25 for range 5 {
26 go func() {
27 managers[i].taskCh <- func() {
28 time.Sleep(time.Duration(rand.Intn(2)) * time.Second)
29 }
30 }()
31 }
32 }
33
34 // set up the director
35 d := &Director{
36 ctx: ctx,
37 group: group,
38 controlSession: &ControlSession{},
39 managers: managers,
40 }
41
42 d.Start()
43}
44
45type Director struct {
46 ctx context.Context
47 group *errgroup.Group
48 managers []*Manager
49 controlSession *ControlSession
50}
51
52func (d *Director) Start() {
53 for _, m := range d.managers {
54 m := m
55 d.group.Go(func() error {
56 m.Start()
57 return nil
58 })
59 }
60
61 ctx, cancel := context.WithCancel(d.ctx)
62 d.group.Go(func() error {
63 d.monitorAnomaly(ctx, cancel)
64 return nil
65 })
66
67 d.group.Go(func() error {
68 return d.controlSession.Start(ctx)
69 })
70
71 doneCh := make(chan error)
72 go func() {
73 doneCh <- d.group.Wait()
74 }()
75
76 for {
77 select {
78 case <-d.ctx.Done():
79 fmt.Println("Stopped director: context canceled")
80 return
81 case err := <-doneCh:
82 fmt.Printf("Stopped director: %v\n", err)
83 return
84 }
85 }
86}
87
88func (d *Director) monitorAnomaly(ctx context.Context, cancel context.CancelFunc) {
89 ticker := time.NewTicker(time.Second)
90 defer ticker.Stop()
91
92 for {
93 select {
94 case <-ticker.C:
95 if d.anomalyDetected() {
96 fmt.Println("Anomaly detected.")
97 cancel()
98 }
99 case <-ctx.Done():
100 d.notifyManagers()
101 return
102 }
103 }
104}
105
106func (d *Director) notifyManagers() {
107 for _, m := range d.managers {
108 m := m
109 m.stopCh <- func() {
110 fmt.Printf("Important message for %s!\n", m.id)
111 }
112 }
113}
114
115func (d *Director) anomalyDetected() bool {
116 return true
117}
118
119// Manager is an actor that processes actions from the actionCh channel.
120type Manager struct {
121 ctx context.Context
122
123 id string
124
125 // a buffered channel for regular tasks
126 taskCh chan func()
127
128 // a unbuffered channel for the function to execute before exit
129 stopCh chan func()
130}
131
132func (m *Manager) Start() {
133 for {
134 select {
135 case f := <-m.taskCh:
136 f()
137 case f := <-m.stopCh:
138 f()
139 fmt.Printf("Stopped manager %s: stop function received.\n", m.id)
140 return
141 case <-m.ctx.Done():
142 fmt.Printf("Stopped manager %s: context canceled.\n", m.id)
143 return
144 }
145 }
146}
147
148// ControlSession represents a session to the control plane.
149type ControlSession struct{}
150
151func (s *ControlSession) Start(ctx context.Context) error {
152 for {
153 select {
154 // For simplicity, we leave out activities of the session.
155 case <-ctx.Done():
156 fmt.Println("Stopped the control session.")
157 return ctx.Err()
158 }
159 }
160}
When running this code, not all managers receive the important messages. Most often, no manager receives the message; occasionally, one does. This non-deterministic behavior results from Go's select statement scheduling.
1Anomaly detected.
2Stopped the control session.
3Stopped director: context canceled
1Anomaly detected.
2Important message for m-0!. Anomaly Detected.
3Stopped manager m-0: stop function received.
4Stopped the control session.
5Stopped director: context canceled
Have you spotted the bug?
The code uses an errgroup with context cancellation to manage goroutines. If any goroutine returns an error, all other goroutines are canceled. Otherwise, the errgroup waits for all goroutines to complete. The bug lies in two places:
1if d.anomalyDetected() {
2 fmt.Println("Anomaly detected.") // line 97
3 cancel()
4}
5
6d.group.Go(func() error {
7 return d.controlSession.Start(ctx) // line 68
8})
When an anomaly is detected, line 97 cancels the context used for the control session. This causes line 68 to return an error (context.Canceled
),
which triggers errgroup to cancel all goroutines. Consequently, manager goroutines may not have a chance to process the important
message from the director.
How to fix the bug with minimal changes?
The simplest fix is to modify the control session goroutine to return nil
when its context is canceled.
Additionally, I addressed a code smell: passing both the context and cancelFunc
to a callee. The caller who owns the
context should handle cancellation rather than delegating it to the callee. Below is the diff; you can view
the full code at fixed/main.go.
Run it in the Go Playground.
1diff -u buggy/main.go fixed/main.go
2--- buggy/main.go 2025-03-23 19:51:26
3+++ fixed/main.go 2025-03-23 19:58:31
4@@ -60,12 +60,14 @@
5
6 ctx, cancel := context.WithCancel(d.ctx)
7 d.group.Go(func() error {
8- d.monitorAnomaly(ctx, cancel)
9+ defer cancel()
10+ d.monitorAnomaly()
11 return nil
12 })
13
14 d.group.Go(func() error {
15- return d.controlSession.Start(ctx)
16+ d.controlSession.Start(ctx)
17+ return nil
18 })
19
20 doneCh := make(chan error)
21@@ -85,7 +87,7 @@
22 }
23 }
24
25-func (d *Director) monitorAnomaly(ctx context.Context, cancel context.CancelFunc) {
26+func (d *Director) monitorAnomaly() {
27 ticker := time.NewTicker(time.Second)
28 defer ticker.Stop()
29
30@@ -94,11 +96,9 @@
31 case <-ticker.C:
32 if d.anomalyDetected() {
33 fmt.Println("Anomaly detected.")
34- cancel()
35+ d.notifyManagers()
36+ return
37 }
38- case <-ctx.Done():
39- d.notifyManagers()
40- return
41 }
42 }
43 }
The fixed version produces this log output, showing that each manager now handles the important message:
1Anomaly detected.
2Important message for m-0!. Anomaly Detected.
3Stopped manager m-0: stop function received.
4Stopped the control session.
5Important message for m-1!. Anomaly Detected.
6Stopped manager m-1: stop function received.
7Important message for m-2!. Anomaly Detected.
8Stopped manager m-2: stop function received.
9Stopped director: context canceled
Lessons we learned
Lesson 1: Using errgroup effectively
Using errgroup is convenient for small scopes and short-lived goroutines, such as synchronizing multiple API calls. However, managing long-running goroutines in a large scope with errgroup requires careful handling of context cancellation and error propagation.
First, errgroup behavior depends on how it's defined and configured:
g := new(errgroup.Group)
:g.Wait()
blocks until all functions passed tog.Go()
return, then returns the first non-nil error (if any)g, ctx := errgroup.WithContext(ctx)
: In addition to blocking and returning the first non-nil error (if any), it also cancels the ctx immediately when any function returns an error. Often, canceling the context means canceling all other goroutines in the errgroupg.SetLimit(n)
: Limits the number of active goroutines in the group to at most n. In this case, you might want to useg.TryGo()
instead ofg.Go()
to avoid unexpected blocking
Second, it's often clearer to synchronize long-running goroutines using Go primitives. errgroup itself is implemented in fewer than 120 lines using sync.WaitGroup
, sync.Once
, and channels.
For long-running daemons, explicit synchronization with channels, goroutines, and sometimes sync.WaitGroup
is preferable. Using errgroup for such cases can make the code hard to read and does not save lines of code anyway. For example, you cannot directly select from errgroup's Wait()
; you need workarounds like this:
1ch := make(chan error)
2go func() {
3 ch <- g.Wait()
4}()
5select {
6case err := <-ch:
7 // do something
8}
Lesson 2: Avoid passing both Context and cancelFunc
It's almost always an anti-pattern to pass both a context.Context
and its associated cancelFunc
as parameters to a function.
-
Separation of Concerns: The
cancelFunc
should typically be managed by the caller or scope that created the context. Passing it down blurs separation of concerns, as the function shouldn't be responsible for canceling a context it didn't create. -
Unexpected Behavior: If a function receives a
cancelFunc
, it might call it prematurely, causing unexpected behavior in the caller or other parts still using the same context. This is exactly what caused the disruption for our customers.
While rare cases may require passing both, clearly document such use cases.
Lesson 3: Avoid Storing context.Context in Structs
Storing a context.Context
in a struct often leads to poorly designed concurrency code. In this example, both the Director and Manager store a context as a field. The Go documentation for context states:
Do not store Contexts inside a struct type; instead, pass a Context explicitly to each function that needs it.
Contexts are a concurrency pattern introduced for three purposes:
- Passing request-scoped values
- Propagating cancellation signals to all goroutines involved in one request
- Enforcing deadlines across API boundaries for all goroutines involved in one request
From my experience:
- If a function needs a context, pass it as the first parameter idiomatically. If a function needs multiple contexts, the function or design likely needs refactoring. Contexts are hierarchical, so it's unusual for a function to operate at multiple hierarchies
- If a struct has a context field, how is it initialized? For example,
New(ctx context.Context, ...)
suggests theNew
function needs to handle cancellation, which is confusing - If a struct has a context field, does the context apply to all its methods or just some? Some methods likely don't care about the context. Passing context explicitly to methods that need it is clearer
- Storing a context in a struct makes testing more difficult. Passing a context explicitly to methods during testing provides greater flexibility and clarity
Therefore, for Go code that doesn't need backward compatibility with pre-context APIs, storing a context in a struct is almost always an anti-pattern. Learn more in Contexts and structs.