Avoid panic on expected errors: lessons from operating journald-to-cwl
We've been using the journald-to-cwl to ship journal logs from EC2 instances to Cloudwatch Logs. It is lightweight and reliable. However, we recently started receiving false positive alarms, which became annoying. This blog covers the changes we made and the key lesson learned: panic on expected errors in Go is generally a bad idea.
Where do false positive alarms come from?
We run many Go programs in production, monitoring their logs and alarming if the total number of panics is over a threshold.
Recently, we expand the log monitoring to include journal logs. Shortly after, we started receiving more alarms.
Since the alarm is aggregated per AWS region, operators must sift through logs to identify which program panicked.
About 10% of the panics come from journald-to-cwl
, which writes its own logs to journal log. For example:
1panic: cannot write events to CWL%!(EXTRA *errors.errorString=too many requests)
1operation error CloudWatch Logs: PutLogEvents, https response error StatusCode: 400,
2 api error ExpiredTokenException: The security token included in the request is expired.
Why was panic used in the first place?
Simplicity is at the core of journald-to-cwl
(JCL). If JCL encounters an issue it cannot handle — such as expired credentials
— it simply exits. I took this simplicity even further: for retriable failures like API throttling, JCL retries once and exits
if the retry fails.
Since systemd automatically restarts JCL, exiting is a safe operation that doesn’t result in log loss. This works because JCL updates the journal log cursor only after logs are successfully acknowledged by CloudWatch Logs (CWL). Leveraging systemd’s restart mechanism also lets JCL do less work. As for how JCL exits, I chose panic over os.Exit(). In Go, panic not only logs the error but also runs deferred functions and set the exit code before exiting. This design has proven to be simple and robust.
How to remove explicit panic?
For retriable errors like API throttle, we now retry indefinitely with a delay. For the non-retriable error like expired credentials, we need to log the error, run deferred functions, exit, and let systemd restart the service to refresh credentials.
The core of JCL follows a "Read → Batch → Deliver" pipeline. The following is the code before change.
1func main() {
2 // ... set up resources ...
3
4 ctx, cancel := context.WithCancel(context.Background())
5 defer cancel()
6
7 go reader.Read(ctx)
8
9 go batcher.Batch(ctx)
10
11 go writer.Write(ctx)
12
13 ch := make(chan os.Signal, 1)
14 signal.Notify(ch, syscall.SIGINT, syscall.SIGTERM)
15 <-ch
16}
It may not be obvious, but panic was previously handling several things for us:
- If any step in the Read → Batch → Deliver pipeline failed, it would stop the entire pipeline.
- It ensured deferred functions in main() ran before exiting, releasing resources.
- It set the exit code to non-zero to indicate failure.
After removing panic and replacing it with explicit returns, we needed to adjust main() to handle exit behavior correctly.
1func main() {
2 // The reason main() calls execute() is that Go does not support both:
3 // 1) Running deferred functions and 2) Setting a non-zero exit code at the same time.
4 os.Exit(execute())
5}
6
7func execute() int {
8 // ... set up resources ...
9
10 ctx, cancel := context.WithCancel(context.Background())
11 defer cancel()
12
13 go func() {
14 defer cancel() // cancel to stop the whole pipeline
15 reader.Read(ctx)
16 }()
17
18 go func() {
19 defer cancel()
20 batcher.Batch(ctx)
21 }()
22
23 go func() {
24 defer cancel()
25 writer.Write(ctx)
26 }()
27
28 select {
29 case s := <-ch:
30 zap.S().Infof("exit signal %v", s)
31 case <-ctx.Done():
32 return 1 // Ensure a non-zero exit code on failure
33 }
34
35 return 0
36}
Explicit panic on expected errors has been removed from journald-to-cwl v0.1.1.