<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
  <channel>
    <title>Practical Engineering</title>
    <link>https://peng.fyi/</link>
    <description>Recent content on Practical Engineering</description>
    <generator>Hugo -- gohugo.io</generator>
    <language>en</language>
    <copyright>Peng Zhang</copyright>
    <lastBuildDate>Thu, 26 Feb 2026 00:00:00 +0000</lastBuildDate><atom:link href="https://peng.fyi/index.xml" rel="self" type="application/rss+xml" />
    <item>
      <title>Avoid using 2D map for transition table in Go</title>
      <link>https://peng.fyi/post/avoid-2d-map-for-state-transition-in-go/</link>
      <pubDate>Thu, 26 Feb 2026 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/avoid-2d-map-for-state-transition-in-go/</guid>
      <description>
        
          
            This post is part 1 of a series of learnings from nilaway. nilaway is a static analysis tool that detects potential nil panics in Go code. It does report false positives, but it&#39;s far from naive. One limitation is that nilaway is flow-sensitive (it understands if x != nil) but not value-correlation-sensitive (it doesn&#39;t understand &amp;quot;if y == SomeConst, then x is non-nil&amp;quot;). From my experience, most false positives point to code smells or code that is fragile to maintain.
          
          
        
      </description>
    </item>
    
    <item>
      <title>cached-imds-client: cache static IMDS responses to improve robustness</title>
      <link>https://peng.fyi/post/cache-imds-to-avoid-linklocal-throttle/</link>
      <pubDate>Sun, 15 Feb 2026 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/cache-imds-to-avoid-linklocal-throttle/</guid>
      <description>
        
          
            I recently encountered IMDS (Instance Metadata Service) request failures with this error:
1&amp;#34;caller&amp;#34;: &amp;#34;actor/actor.go:101&amp;#34;, 2&amp;#34;error&amp;#34;: &amp;#34;operation error ec2imds: GetMetadata, failed to get rate limit token, retry quota exceeded, x available, y requested The root cause: the aws-sdk-go-v2 IMDS client exhausted its retry token bucket. The AWS SDK for Go v2 implements client-side rate limiting (&amp;quot;retry quotas&amp;quot;) to prevent overwhelming services. The standard strategy works as follows:
Initial attempt: Consumes no tokens, adds 1 token back on success Retry attempt: Deducts 5 tokens (10 for timeouts) If tokens available: Retry proceeds If tokens exhausted: Returns QuotaExceededError On success: Adds 1 token back (max capacity: 500) When IMDS requests time out or retry frequently, they drain the token bucket quickly.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Using WaitGroup to Track Work Items, Not Workers: A Multi-threaded BFS Example</title>
      <link>https://peng.fyi/post/waitgroup-track-work-not-workers-multi-thread-bfs/</link>
      <pubDate>Sun, 15 Feb 2026 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/waitgroup-track-work-not-workers-multi-thread-bfs/</guid>
      <description>
        
          
            WaitGroup and channels are two powerful primitives in Go for synchronizing goroutines. A common pattern uses a WaitGroup to wait for goroutines completion:
1wg.Add(1) 2go func() { 3 defer wg.Done() 4 for { 5 select { 6 case &amp;lt;- done: 7 return 8 case task &amp;lt;- tasks: 9 handle(task) 10 } 11 } 12}() 13wg.Wait() In this post, we&#39;ll explore an interesting use of WaitGroup and channels where the WaitGroup counts work items each goroutine handles rather than goroutines.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Simplify device path on boot with udev</title>
      <link>https://peng.fyi/post/simplify-device-path-on-boot-with-udev/</link>
      <pubDate>Mon, 02 Feb 2026 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/simplify-device-path-on-boot-with-udev/</guid>
      <description>
        
          
            While prototyping Bottlerocket, I discovered it doesn&#39;t recognize additional EBS volumes specified through Block device mappings on Xen. For example, launching the same AMI on t2.medium (Xen) and t3.medium (Nitro) with &amp;quot;DeviceName=/dev/xvdcz&amp;quot;:
On Nitro, the device appears at /dev/nvme1n1 and /dev/disk/by-ebs-id/xvdcz. On Xen, it appears at /dev/xvdcz. In the prototype, the xvdcz volume is formatted, labeled, and mounted, involving format-data-volume.service and mnt-data.mount. The format service hardcodes its dependency on /dev/disk/by-ebs-id/xvdcz.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Use KillMode=process with caution: restart loop could deplete resources</title>
      <link>https://peng.fyi/post/think-again-before-use-killmode-process/</link>
      <pubDate>Fri, 12 Dec 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/think-again-before-use-killmode-process/</guid>
      <description>
        
          
            I recently debugged a resource leak where a systemd service kept restarting while leaving a process behind after each restart. The root cause isn&#39;t particularly interesting: a backward-incompatible third-party dependency upgrade. But the debugging process and lessons learned are.
Thousands of zombie processes from a systemd service I have a foo.service that runs /usr/bin/start-foo, which spawns a new process to run /usr/bin/foo. A few minutes after boot, CPU and memory utilization reached 100%, growing linearly rather than spiking.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Spawning a New Process for Socket-Activated Daemons is Error-Prone</title>
      <link>https://peng.fyi/post/antipattern-in-socket-activation-spawn-new-process-for-the-daemon/</link>
      <pubDate>Wed, 10 Dec 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/antipattern-in-socket-activation-spawn-new-process-for-the-daemon/</guid>
      <description>
        
          
            I recently debugged a mysterious latency issue: after migrating a systemd service from path-activation to socket-activation, there was a consistent ~1 second time-to-available latency. The culprit was a bad practice—starting the daemon program as a new process in socket-activation. Let&#39;s dive into the details.
Starting soci-snapshotter On-Demand Using Socket Activation soci-snapshotter is an open-source containerd snapshotter plugin that enables pulling OCI images in lazy loading mode using SOCI indexes. In our container runtime, we start soci-snapshotter only when an image has a SOCI index.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Be careful making thread-aware syscalls in Go: lock the thread</title>
      <link>https://peng.fyi/post/thread-aware-syscall-in-go-lock-the-thread/</link>
      <pubDate>Mon, 20 Oct 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/thread-aware-syscall-in-go-lock-the-thread/</guid>
      <description>
        
          
            A bug caused around 0.5% of container workloads to fail to start during load test. This post walks through the bug and its fix, an interesting mix of Linux namespaces, Go concurrency, and syscalls.
The need to run a program in its own network namespace and mount namespace soci-snapshotter is an open-source containerd snapshotter plugin that enables pulling OCI images in lazy loading mode. It downloads the necessary layers to start the container and then downloads the rest while the container is running.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Speed up building Bottlerocket image in AWS CodeBuild</title>
      <link>https://peng.fyi/post/speed-up-building-br-image-in-codebuild/</link>
      <pubDate>Mon, 20 Oct 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/speed-up-building-br-image-in-codebuild/</guid>
      <description>
        
          
            When I first moved building Bottlerocket AMI from an EC2 host to AWS CodeBuild, I was hit by a very slow build. On an EC2 instance, I built both the x86 and Arm versions on x86 instances, and fresh builds finished in 5 minutes. However, on CodeBuild with more vCPU and memory, the build process was painfully slower.
The x86 CodeBuild uses a compute of &amp;quot;145 GB memory, 72 vCPUs&amp;quot;. The build finishes in 18 minutes, with the build-variant step taking 97% of the time.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Mysterious Image Pull Failures: &#34;401 Unauthorized&#34; and &#34;Not Found&#34; After Migrating Containerd to v2</title>
      <link>https://peng.fyi/post/mysterious-image-pull-failure-wait-for-interface-ready/</link>
      <pubDate>Sun, 12 Oct 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/mysterious-image-pull-failure-wait-for-interface-ready/</guid>
      <description>
        
          
            Early this year, we migrated containerd from v1.7 to v2.0.5. However, we quickly noticed image pulls from Amazon Elastic Container Registry (ECR) began failing for both public and private ECR repositories. For example:
1# public ECR 2FATA[0031] failed to resolve reference &amp;#34;public.ecr.aws/aws-cli/aws-cli:2.31.5@sha256:9cb6ab9c8852d7e1e63f43299dca0628e92448d1c7589f7ed40344e7f61aad59&amp;#34;: \ 3unexpected status from HEAD request to https://public.ecr.aws/v2/aws-cli/aws-cli/blobs/sha256:9cb6ab9c8852d7e1e63f43299dca0628e92448d1c7589f7ed40344e7f61aad59: \ 4401 Unauthorized 5 6# private ECR 7FATA[0031] failed to resolve reference &amp;#34;&amp;lt;account-id&amp;gt;.dkr.ecr.us-west-2.amazonaws.com/xxxx@sha256:1db8db35b1afaa9d2df40f68e35cc0e4f406b5f73667b1ba2d0f73dbb15aed01&amp;#34;: \ 8&amp;lt;account-id&amp;gt;.dkr.ecr.us-west-2.amazonaws.com/xxxx@sha256:1db8db35b1afaa9d2df40f68e35cc0e4f406b5f73667b1ba2d0f73dbb15aed01: \ 9not found The failures occurred at a low frequency of around 0.
          
          
        
      </description>
    </item>
    
    <item>
      <title>EC2 IMDS is Unstable During Early Boot: Always Retry</title>
      <link>https://peng.fyi/post/retry-imds-request/</link>
      <pubDate>Mon, 15 Sep 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/retry-imds-request/</guid>
      <description>
        
          
            In Detect and fix rare cases where the primary ENI does not serve default traffic , we used IMDS &amp;quot;meta-data/mac&amp;quot; to get the primary ENI&#39;s MAC address. However, we encountered the following errors in 0.5% of EC2 ARM instance launches:
1failed to get IMDS /mac: operation error ec2imds: GetMetadata, exceeded maximum number of attempts, 3, 2http response error StatusCode: 401, request to EC2 IMDS failed 1failed to get IMDS /mac: operation error ec2imds: GetMetadata, exceeded maximum number of attempts, 3, 2request send failed, Get &amp;#34;http://169.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Who Modified My Program in Bottlerocket?</title>
      <link>https://peng.fyi/post/who-moved-my-program/</link>
      <pubDate>Thu, 11 Sep 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/who-moved-my-program/</guid>
      <description>
        
          
            There are a few programs we install in Bottlerocket that cannot be built from source. For these programs, we download the binary from a secure repository and install it using an RPM spec like this:
1# foo.spec 2Name: %{_cross_os}foo 3 4Source0: foo 5 6%install 7install -d %{buildroot}%{_cross_sbindir} 8install -D -p -m 0755 %{S:0} %{buildroot}%{_cross_sbindir} A teammate discovered that the foo binary downloaded from the repository differs from the /sbin/foo installed in the built Bottlerocket AMI:
          
          
        
      </description>
    </item>
    
    <item>
      <title>Introducing bottlerocket-extra-kit: Essential debugging tools for Bottlerocket</title>
      <link>https://peng.fyi/post/debug-on-bottlerocket-and-introduce-bottlerocket-extra-kit/</link>
      <pubDate>Mon, 01 Sep 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/debug-on-bottlerocket-and-introduce-bottlerocket-extra-kit/</guid>
      <description>
        
          
            Bottlerocket is a Linux-based operating system optimized for hosting containers. We use Bottlerocket to run millions of containers each day. There are three key differences between Bottlerocket and common Linux distributions like Amazon Linux 2023:
The rootfs is read-only. There is no package manager (e.g., yum) in Bottlerocket. Each package in Bottlerocket must be built into the OS variant. Enforced SELinux with Bottlerocket&#39;s own SELinux policies. This makes it difficult for developers to debug and experiment.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Tips for Building Bottlerocket AMIs</title>
      <link>https://peng.fyi/post/tips-build-bottlerocket-ami/</link>
      <pubDate>Wed, 20 Aug 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/tips-build-bottlerocket-ami/</guid>
      <description>
        
          
            Bottlerocket is a Linux-based operating system optimized for hosting containers. At my work, we migrated from Amazon Linux to Bottlerocket and experienced the following benefits:
Developer-friendly: Easy to understand and fast to build. RPM spec and configuration TOML files are all you need. Every developer can build a Bottlerocket AMI on an EC2 instance in just a few minutes. For example, I can build and register a new Bottlerocket AMI from scratch in less than 4 minutes on a c5.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Working Knowledge of Linux Memory: Concepts</title>
      <link>https://peng.fyi/post/working-knowledge-of-linux-memory-concepts/</link>
      <pubDate>Mon, 04 Aug 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/working-knowledge-of-linux-memory-concepts/</guid>
      <description>
        
          
            I recently dealt with a server livelock issue caused by memory page thrashing. This post refreshes the Linux memory basics I found useful for debugging the issue. Much of the content is from Chapter 7 of Systems Performance: Enterprise and the Cloud.
Virtual Memory Virtual memory is an abstraction that provides each process and the kernel with its own large, linear, and private address space. Virtual memory is also referred to as process virtual address space.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Detect and fix rare cases where the primary ENI does not serve default traffic</title>
      <link>https://peng.fyi/post/swap-eth-names-when-primary-eni-does-not-serve-default-traffic/</link>
      <pubDate>Sun, 27 Jul 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/swap-eth-names-when-primary-eni-does-not-serve-default-traffic/</guid>
      <description>
        
          
            During testing, we encountered a rare scenario when launching EC2 instances with multiple ENIs: the primary ENI (device index 0) does not serve default network traffic. This occurs in approximately 1 out of 10,000 launches (0.01%).
For example, when configuring two ENIs on an instance—ENI-0 (deviceIndex=0) from subnet-0 and ENI-1 (deviceIndex=1) from subnet-1—Linux may recognize eth0 as being from subnet-1 instead of the expected subnet-0. In my use case, we must ensure default traffic routes through the primary ENI for security compliance.
          
          
        
      </description>
    </item>
    
    <item>
      <title>SELinux Concepts</title>
      <link>https://peng.fyi/post/selinux-part-1-concepts/</link>
      <pubDate>Sun, 15 Jun 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/selinux-part-1-concepts/</guid>
      <description>
        
          
            Security-Enhanced Linux (SELinux) is a mandatory access control (MAC) system that enhances Linux security. &amp;quot;Mandatory&amp;quot; means access control is strictly enforced by predefined policy rules—users and processes cannot modify these rules at will, ensuring security is not left to individual discretion. SELinux is available in major distributions, including Amazon Linux 2023 (AL2023) and Bottlerocket. This post is the part 1 of a series on SELinux.
Labels and Contexts SELinux labels every file and process using a quadruple: (user:role:type:level).
          
          
        
      </description>
    </item>
    
    <item>
      <title>Modern Go idioms</title>
      <link>https://peng.fyi/post/modern-go-idioms/</link>
      <pubDate>Sun, 18 May 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/modern-go-idioms/</guid>
      <description>
        
          
            Go is known for its backward compatibility, simplicity, and six-month release cycle. But that can sometimes lead to code that works yet isn&#39;t as modern as it could be. This post is a living document where I note modern Go idioms I&#39;ve used to improve clarity and maintainability.
Use GOOS and GOARCH in file names for build constraints When targeting specific operating systems or architectures, Go lets us name files like foo_linux.
          
          
        
      </description>
    </item>
    
    <item>
      <title>A Few Shell Surprises</title>
      <link>https://peng.fyi/post/a-few-shell-surprises/</link>
      <pubDate>Tue, 22 Apr 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/a-few-shell-surprises/</guid>
      <description>
        
          
            Shell scripts are infamous for security issues and surprising behavior, so when possible, it&#39;s better to avoid using shell. For instance, we built a container platform using the Bottlerocket OS, and we didn&#39;t even install a shell. If someone needs to run a shell, it must be run inside a container. That said, shell is still handy for ad hoc scripting. In this post, I&#39;ll share a few surprising behaviors I&#39;ve encountered recently.
          
          
        
      </description>
    </item>
    
    <item>
      <title>x509: certificate signed by unknown authority? Maybe the cert pool is empty</title>
      <link>https://peng.fyi/post/empty-cert-pool-x509-certificate-signed-by-unknown-authority/</link>
      <pubDate>Tue, 15 Apr 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/empty-cert-pool-x509-certificate-signed-by-unknown-authority/</guid>
      <description>
        
          
            I recently worked on getting amazon-ssm-agent to run inside containers on Bottlerocket. During that process, I ran into a TLS issue connecting to amazonaws.com. The root cause turned out be interesting and we&#39;ll walk through it in this post.
Running amazon-ssm-agent in a container: why and how? To enable sessions between a container and the outside world, we followed the same approach as the ECS Execute-Command proposal. The idea is to prepare a directory on the host that contains all the files required by SSM, then bind mount that directory into the container.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Sharp edges of errgroup: Lessons from an errgroup and Context mishap</title>
      <link>https://peng.fyi/post/lessons-from-an-errgroup-and-context-mishap/</link>
      <pubDate>Sun, 23 Mar 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/lessons-from-an-errgroup-and-context-mishap/</guid>
      <description>
        
          
            A recent faulty release disrupted service for some customers. The root cause was a concurrency bug involving x/sync/errgroup and context cancellation. This post shares three practices we learned from the incident. These practices will help us catch similar issues during code review or alert us to problems in production.
What does the buggy code do? I&#39;ve simplified the program for this post as follows:
One director manages three managers Each manager processes tasks periodically The director maintains a control session with an external system The director monitors for anomalies.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Avoid panic on expected errors: lessons from operating journald-to-cwl</title>
      <link>https://peng.fyi/post/avoid-panic-on-expected-error/</link>
      <pubDate>Sun, 23 Feb 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/avoid-panic-on-expected-error/</guid>
      <description>
        
          
            We&#39;ve been using journald-to-cwl to ship journal logs from EC2 instances to CloudWatch Logs. It is lightweight and reliable. However, we recently started receiving false positive alarms, which became annoying. This blog covers the changes we made and the key lesson learned: panicking on expected errors in Go is generally a bad idea.
Where Do False Positive Alarms Come From? We run many Go programs in production, monitoring their logs and alarming if the total number of panics exceeds a threshold.
          
          
        
      </description>
    </item>
    
    <item>
      <title>GPG is still in use to verify downloads</title>
      <link>https://peng.fyi/post/gpg-is-still-in-use/</link>
      <pubDate>Sun, 23 Feb 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/gpg-is-still-in-use/</guid>
      <description>
        
          
            This week, I needed to install the Amazon SSM Agent and was surprised to find that GPG (GNU Privacy Guard) was the only way to verify the download. I had assumed that software downloads verification had largely transitioned to PKI (Public Key Infrastructure). This short post is a refresh on GPG.
OpenPGP is an open standard for encrypting and signing data, originally derived from PGP (Pretty Good Privacy). It defines the format and encryption protocols used for secure communication.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Why does GOMEMLIMIT take up significant physical memory for unused virtual memory?</title>
      <link>https://peng.fyi/post/why-does-gomemlimit-take-up-significant-physical-memory-for-unused-virtual-memory/</link>
      <pubDate>Sun, 19 Jan 2025 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/why-does-gomemlimit-take-up-significant-physical-memory-for-unused-virtual-memory/</guid>
      <description>
        
          
            While debugging memory bloat in a Go application recently, I found that removing the GOMEMLIMIT soft memory limit and disabling transparent huge pages partially mitigated the issue. However, I couldn&#39;t fully explain why these changes worked. So I thought why not ask the internet about it.
A simplified memory bloat program The following Go program vm-demo.go demonstrates memory bloat by allocating a 400 MiB slice every second. It saves references to the slices without any read or write operations.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Don&#39;t Use stderr to Determine Process Failure Because Logs Default to stderr</title>
      <link>https://peng.fyi/post/logs-go-to-stderr-by-default/</link>
      <pubDate>Sat, 30 Nov 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/logs-go-to-stderr-by-default/</guid>
      <description>
        
          
            It&#39;s a beautiful day, and it started with a simple code review:
1# tools/foo/main.go 2- fmt.Println(&amp;#34;found it&amp;#34;) 3+ log.Println(&amp;#34;found it&amp;#34;) The author explained the advantages of using a logging library over plain printf. The rationale was straightforward, so I approved the change without hesitation. However, two hours later, another code review came through—this time reverting the change because the load test pipeline had failed. Intriguing. Let&#39;s take a closer look.
          
          
        
      </description>
    </item>
    
    <item>
      <title>AL2023 vs. AL2: less disk space with ext4?</title>
      <link>https://peng.fyi/post/al2023-vs-al2-less-disk-space-with-ext4/</link>
      <pubDate>Sun, 17 Nov 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/al2023-vs-al2-less-disk-space-with-ext4/</guid>
      <description>
        
          
            We started migrating from Amazon Linux 2 (AL2) to Amazon Linux 2023 (AL2023) a month ago. While testing workloads on AL2023 in the pre-production environment, I noticed slightly higher disk usage compared to the same workload on AL2. In this post, I&#39;ll share my investigation.
AL2023 Has Less Free Disk Space with ext4, Compared to AL2 Although disk usage metrics increased on AL2023, the &amp;quot;Used&amp;quot; space remained the same. The higher disk usage is due to a decrease in the total ext4 filesystem size.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Ways Go programs die</title>
      <link>https://peng.fyi/post/ways-go-programs-die/</link>
      <pubDate>Sun, 10 Nov 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/ways-go-programs-die/</guid>
      <description>
        
          
            Our Go programs recently triggered an alarm due to excessive panics. Panic is a Go runtime mechanism that halts execution. It got me thinking about different ways a Go program can die. I don&#39;t expect many - not like A Million Ways to Die in the West. In this post, we&#39;ll go through the various ways Go programs die. These fall into two categories: voluntarily choosing to die, and involuntarily being killed by the Go runtime or the operating system.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Missing Container Disk I/O Stats with cgroup v1 on Kernel 6.1</title>
      <link>https://peng.fyi/post/container-disk-io-stats-missing-on-kernel-6.1-cgroupv1/</link>
      <pubDate>Sat, 09 Nov 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/container-disk-io-stats-missing-on-kernel-6.1-cgroupv1/</guid>
      <description>
        
          
            As Amazon Linux 2 (AL2) approaches its End of Life on June 30, 2025, we have started migrating our container platform from AL2 to Bottlerocket. The migration encountered a few speed bumps. In this post, we&#39;ll examine one of them: missing container disk I/O stats.
Why are container I/O dashboards blank? Since Bottlerocket shares the same kernel used by Amazon Linux 2023 (AL2023), I will use AL2023 for the ease of demonstration.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Mind ordering cycles in systemd: how systemd breaks them can brick server startup</title>
      <link>https://peng.fyi/post/systemd-cycle-dependencies/</link>
      <pubDate>Wed, 16 Oct 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/systemd-cycle-dependencies/</guid>
      <description>
        
          
            I&#39;ve been building a service for a month, and the day finally arrived when I had the artifact - an EC2 AMI. The AMI passed my &amp;quot;rigorous&amp;quot; manual tests so I launched 100 EC2 instances. Surprise! Around 28 instances failed to launch.
What&#39;s going on? All failed instances were stuck in the &amp;quot;initializing&amp;quot; state, and the only way to connect to them was through EC2 Serial Console. There, I noticed something interesting.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Monotonicity: Find 1-3-2 Pattern</title>
      <link>https://peng.fyi/post/monotonicity-subinterval-132/</link>
      <pubDate>Mon, 14 Oct 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/monotonicity-subinterval-132/</guid>
      <description>
        
          
            Given an array of numbers A, find out whether it contains a 1-3-2 pattern. A 1-3-2 pattern is a subsequence of three numbers, A[i], A[j] and A[k] such that i &amp;lt; j &amp;lt; k and A[i] &amp;lt; A[k] &amp;lt; A[j].
For clarity, let&#39;s call the 1-3-2 pattern the Bronze-Gold-Silver pattern. If A[j] is Gold, then we should consider the minimum number from A[0:j) to be Bronze, because it gives us the largest range for picking Silver.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Hoare Partition, one of the simplest and most beautiful algorithms</title>
      <link>https://peng.fyi/post/hoare-partition/</link>
      <pubDate>Mon, 17 Jun 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/hoare-partition/</guid>
      <description>
        
          
            Tony Hoare invented QuickSort in 1961. At the time of its publication, the best comparison-based sorting algorithm was merge sort. Merge sort divides an unordered array into two equally sized subarrays, sorts each subarray, and then merge the two subarrays to produce a sorted array. Merge sort is simple to understand. However, quicksort is just as simple as merge sort but more elegant. In quicksort, there is no requirement for the two subarrays to be of equal size, and there is no merging step.
          
          
        
      </description>
    </item>
    
    <item>
      <title>No More Confusion of Upstream and Downstream</title>
      <link>https://peng.fyi/post/upstream-and-downstream/</link>
      <pubDate>Thu, 07 Mar 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/upstream-and-downstream/</guid>
      <description>
        
          
            I often find myself confused by &amp;quot;upstream&amp;quot; and &amp;quot;downstream&amp;quot;, in the context of software development. They bother me so much that I avoid using them in my own writing and I have to pause whenever I see them. In this post, I&#39;ll show a simple rule that helps you remember the difference: downstream adds value to the output of upstream.
Downstream adds value to the output of upstream. Let&#39;s take a break from software development for a moment and look at the oil industry.
          
          
        
      </description>
    </item>
    
    <item>
      <title>False sharing: a look at cache line and write combining</title>
      <link>https://peng.fyi/post/false-sharing-cacheline-and-write-buffer/</link>
      <pubDate>Fri, 23 Feb 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/false-sharing-cacheline-and-write-buffer/</guid>
      <description>
        
          
            Modern CPUs operate significantly faster than memory. A 4.5 GHz x86_64 CPU operates 30 times faster than 6000 MHz DDR5 memory with CAS Latency 36. When accounting for latencies from the bus and memory coherency protocols, memory can be 100 times slower than registers. To mitigate this speed gap, CPUs use layers of caches organized around cache lines, typically 64 bytes each. However, programming languages operate at the byte level. This mismatch between byte-level abstractions and cache line realities can create a false sharing: a false sense of data independence.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Binary Search in Go standard library</title>
      <link>https://peng.fyi/post/binary-search-in-go-sdk/</link>
      <pubDate>Sun, 11 Feb 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/binary-search-in-go-sdk/</guid>
      <description>
        
          
            Given a non-decreasing array and a target value, we can find the target in logarithmic time using binary search. My first programming language was C++, and the C++ Standard Template Library (STL) provides two functions for this task:
iterator lower_bound(first, last, value) returns the smallest index with a value greater than or equal to the target. Put another way, if you were to insert the target, the lower_bound is the smallest index at which to insert it.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Monotonic Stack: Steps to Make Array Non-decreasing</title>
      <link>https://peng.fyi/post/steps-to-make-array-non-decreasing/</link>
      <pubDate>Sun, 28 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/steps-to-make-array-non-decreasing/</guid>
      <description>
        
          
            Problem Given an integer array A, in one step, remove all elements A[i] where A[i-1] &amp;gt; A[i]. Return the number of steps performed until A becomes a non-decreasing array. See examples at LeetCode 2289.
Solution The naive approach executes steps one by one. Store integers in a Linked List. At each step, find all integers that are smaller than its left neighbor and remove them. However, the time complexity is O(n^2) because each step may remove just one integer.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Calculate number of nodes in a linear network using message passing</title>
      <link>https://peng.fyi/post/calculate-linear-network-size-by-message-passing/</link>
      <pubDate>Sat, 20 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/calculate-linear-network-size-by-message-passing/</guid>
      <description>
        
          
            Problem In a connected network consisting of N nodes, each node is connected to either one or two neighbors, forming a line topology. The task is to develop a program that runs on each node, calculating the total number of nodes in the network. Each node is aware of its neighboring nodes and can exchange messages with them. It is important to note that nodes do not share memory; the only means of information exchange between two neighbors is through sending and receiving messages.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Nested Map: Breakdown analysis of events and return result as nested JSON</title>
      <link>https://peng.fyi/post/breakdown-analysis-nested-json/</link>
      <pubDate>Fri, 19 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/breakdown-analysis-nested-json/</guid>
      <description>
        
          
            Problem An event consists of multiple properties, each defined as a key-value pair, where the key is a string and the value is of a primitive type such as numbers or strings. Importantly, each event must include a mandatory &#39;Name&#39; property. Given a list of events, the task is to count the number of events based on specified properties. To illustrate, let&#39;s consider an example with four events and two analyses:
          
          
        
      </description>
    </item>
    
    <item>
      <title>Factorial Growth of Subqueries When Using Nested WITH Clauses in ClickHouse</title>
      <link>https://peng.fyi/post/factorial-growth-of-clickhouse-with-clause/</link>
      <pubDate>Fri, 12 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/factorial-growth-of-clickhouse-with-clause/</guid>
      <description>
        
          
            ClickHouse is a popular OLAP database. It speaks SQL and earns the reputation of &amp;quot;fast and resource efficient&amp;quot;. But the support of SQL comes with surprises if not careful. In this blog, I show that a simple query of nested WITH clauses in ClickHouse generates factorial number of subqueries. The simple query is short, reads nothing, process nothing and returns nothing. Yet, it uses a lot of CPU and memory to just parse the query.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Maglev Hash: Consistent Hash with Guaranteed Even Distribution</title>
      <link>https://peng.fyi/post/maglev-hash-alternatives-for-ring-hash/</link>
      <pubDate>Sun, 07 Jan 2024 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/maglev-hash-alternatives-for-ring-hash/</guid>
      <description>
        
          
            In distributed systems, because there are too many requests to be handled by a single server reliably, requests are handled by a cluster of servers. In order to get high availability, the technique of distributing requests to servers needs to satisfy the following three requirements.
Even distribution. Each backend take about M/N requests, where M is the number of requests and N is the number of servers. Low disruption. Adding or removing one server causes about M/N requests to be re-distributed.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Monotonicity: Sliding Window</title>
      <link>https://peng.fyi/post/monotonicity-sliding-window/</link>
      <pubDate>Mon, 27 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/monotonicity-sliding-window/</guid>
      <description>
        
          
            A function is monotonic if it preserves the order of its arguments, i.e., if $x \le y$, then $f(x) \le f(y)$. In this post, we examine a class of problems where the argument is an interval. By identifying monotonic functions, we can reduce the number of intervals to enumerate by an order of magnitude, from $O(n^2)$ to $O(n)$. This algorithm is often known as sliding window, because enumerating intervals is like sliding a window across the data.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Monotonicity: Find Largest Subarrays for Each Array Element</title>
      <link>https://peng.fyi/post/monotonicity-stack/</link>
      <pubDate>Thu, 23 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/monotonicity-stack/</guid>
      <description>
        
          
            Given an array of numbers $A$, for each number $A[i]$, find the largest subarray that contains $A[i]$ and $A[i]$ is the minimum of the subarray. For example, for $A=[2, 0, 3, 5, 1, 1, 0, 2 1]$, the largest subarray for $A[4]$ is $[3,5,1,1]$. Let&#39;s represent the subarray for $A[i]$ as the left boundary $l[i]$ and right boundary $r[i]$. $$l[i] = \min_{j \le i }\lbrace \forall_{j \le k \le i} A[k] \ge A[i] \rbrace $$ $$r[i] = \max_{j \ge i }\lbrace \forall_{i \le k \le j} A[k] \ge A[i] \rbrace $$
          
          
        
      </description>
    </item>
    
    <item>
      <title>Set GOMAXPROCS for Go programs in containers</title>
      <link>https://peng.fyi/post/gomaxprocs-in-container/</link>
      <pubDate>Wed, 22 Nov 2023 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/gomaxprocs-in-container/</guid>
      <description>
        
          
            Every Go program has a runtime. The runtime implements garbage collection, concurrency, stack management, and other critical features. We can configure the runtime by setting variables. In this post, we will look at GOMAXPROCS, a variable that configures concurrency. You may get free performance boost by setting GOMAXPROCS when running Go in containers.
What is GOMAXPROCS? The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously.
          
          
        
      </description>
    </item>
    
    <item>
      <title>A simple and customizable load test tool in Go</title>
      <link>https://peng.fyi/post/simple-customizable-loadtest/</link>
      <pubDate>Sun, 15 Oct 2023 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/simple-customizable-loadtest/</guid>
      <description>
        
          
            We plan to load test our product before public Beta, with two goals in mind.
Find out bottlenecks: figure out road maps of performance improvement and prepare for oncalls. Understand how much workload we can support with fixed resources. This shapes the pricing strategy and determines number of Beta partners to onboard, without burning runway. Because it is complicated to generate meaningful loads to the system, we cannot use general tools like K6.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Monotonicity: Find the K-th number in two sorted arrays</title>
      <link>https://peng.fyi/post/kth-number-in-two-sorted-arrays/</link>
      <pubDate>Tue, 10 Oct 2023 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/kth-number-in-two-sorted-arrays/</guid>
      <description>
        
          
            Given two sorted arrays of integers, $A$ and $B$, find the $K$-th smallest integer from A and B in $O(\min(\log{N}, \log{M}))$ time. $N$ and $M$ are the size of $A$ and $B$ respectively.
Without loss of generality, we can assume A and B are of the same length N. Because both A and B are sorted, we can merge A and B into a sorted array C in $O(N)$ and the answer would be C[K-1].
          
          
        
      </description>
    </item>
    
    <item>
      <title>Priority Map: A hash map with access of the minimum value</title>
      <link>https://peng.fyi/post/priority-map/</link>
      <pubDate>Mon, 09 Oct 2023 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/priority-map/</guid>
      <description>
        
          
            I implemented a job scheduler at work recently. Each job has a unique ID and an expiration time. The job ID is immutable while the expiration time may change. The job scheduler schedules the job of the earliest expiration time. Go comes with heap.Interface and map. My first implementation uses a hashmap and a slice that implements the heap.Interface. The hashmap maps the Job and the index of the Job in the heap array.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Gotchas of defer in Go</title>
      <link>https://peng.fyi/post/gotchas-of-defer-in-go/</link>
      <pubDate>Sun, 10 Sep 2023 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/gotchas-of-defer-in-go/</guid>
      <description>
        
          
            A defer statement invokes a function just before the surrounding function returns. Multiple defers within a function are executed in reverse order of their calls, Last In First Out (LIFO). Defer is commonly used to ensure resource cleanup, regardless of the function&#39;s success or failure. For instance,
1func (h *Handler) Handle(ctx context.Context) { 2 // Use defer to catch and log panics. This prevents web server crash. 3 defer func() { 4 if r := recover(); r !
          
          
        
      </description>
    </item>
    
    <item>
      <title>CORS error with 504 Gateway timeout</title>
      <link>https://peng.fyi/post/cors-error-with-504/</link>
      <pubDate>Fri, 01 Sep 2023 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/cors-error-with-504/</guid>
      <description>
        
          
            Like many developers, I often leave browser&#39;s &amp;quot;Developer Tools&amp;quot; open for websites of interests. Last week, while playing with our staging service, I saw repeated CORS errors in the Console tab and 504 Gateway Timeouts in the Network tab. It was my first time seeing these two errors together. So I decided to look into it a bit. In this blog, I will reproduce the issue and share some good practices handling CORS.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Deep dive in &#34;context canceled&#34; errors on Go web servers</title>
      <link>https://peng.fyi/post/context-cancel-go-web/</link>
      <pubDate>Thu, 29 Jun 2023 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/context-cancel-go-web/</guid>
      <description>
        
          
            At iheartjane, we use Go web server to serve Ad requests. After some time in production, we noticed a lot of “context canceled” error logs. The following screenshot of CloudWatch Log Insights query shows the frequency of “context canceled” errors. It left us puzzled about the underlying causes of these context cancels. Should we worry about it? If yes, how should we reduce context cancels. What are “Context Canceled” in Go?
          
          
        
      </description>
    </item>
    
    <item>
      <title>Simplicity By Default</title>
      <link>https://peng.fyi/post/simplicity-by-default/</link>
      <pubDate>Tue, 23 May 2023 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/simplicity-by-default/</guid>
      <description>
        
          
            I’ve observed quite a few instances where engineers, including myself, have created unnecessarily complex solutions. To cultivate a better R&amp;amp;D organization, it is crucial to choose simplicity by default. In this blog, we’ll look at three don’ts and three do’s that help make simplicity the default.
Don’t invent requirements. Engineers solve problems, which are often made complex due to bloated or unclear requirements. Simplify by reducing requirements to a minimal set that satisfies the product.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Six Options for Generating Distributed Unique IDs</title>
      <link>https://peng.fyi/post/six-options-generating-distributed-ids/</link>
      <pubDate>Thu, 20 Apr 2023 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/six-options-generating-distributed-ids/</guid>
      <description>
        
          
            Identifying unique entities is a frequent requirement in software development. For instance, assigning a unique ID to each Ad impression enables us to link related events for billing and analysis. However, generating unique IDs becomes challenging when dealing with large distributed systems. In this survey, we explore various options and discuss their suitability. We also introduce the snowflake-id service, a distributed system for generating unique IDs based on the snowflake algorithm and utilizing the etcd.
          
          
        
      </description>
    </item>
    
    <item>
      <title>Creating test doubles in Go, manual or auto-generated?</title>
      <link>https://peng.fyi/post/creating-test-doubles-in-go-manual-or-generated/</link>
      <pubDate>Thu, 30 Mar 2023 00:00:00 +0000</pubDate>
      
      <guid>https://peng.fyi/post/creating-test-doubles-in-go-manual-or-generated/</guid>
      <description>
        
          
            We set up test doubles when it is difficult or impossible to use real objects due to complexity or external dependencies. In Go, an interface is a collection of method signatures that define a set of behaviors. With interfaces, it is easy to create test doubles in Go. However, should you write test doubles on your own, use a test library that supports mock, or a tool that generates mock automatically?
          
          
        
      </description>
    </item>
    
  </channel>
</rss>
