Practical Engineering
open-menu closeme
Engineering
github linkedin rss
  • Introduce cached-imds-client: cache static IMDS responses to improve robustness

    calendar Feb 15, 2026 · 1 min read · AWS Go  ·
    Share on: twitter copy

    I recently encountered IMDS (Instance Metadata Service) request failures with this error: 1"caller": "actor/actor.go:101", 2"error": "operation error ec2imds: GetMetadata, failed to get rate limit token, retry quota exceeded, x available, y requested The root cause: the aws-sdk-go-v2 IMDS …


    Read More
  • Using WaitGroup to Track Work Items, Not Workers: A Multi-threaded BFS Example

    calendar Feb 15, 2026 · 5 min read · Go Concurrency  ·
    Share on: twitter copy

    WaitGroup and channels are two powerful primitives in Go for synchronizing goroutines. A common pattern uses a WaitGroup to wait for goroutines completion: 1wg.Add(1) 2go func() { 3 defer wg.Done() 4 for { 5 select { 6 case <- done: 7 return 8 case task <- tasks: 9 handle(task) 10 } 11 } 12}() 13wg.Wait() In this …


    Read More
  • Simplify device path on boot with udev

    calendar Feb 2, 2026 · 4 min read · Linux Bottlerocket  ·
    Share on: twitter copy

    While prototyping Bottlerocket, I discovered it doesn't recognize additional EBS volumes specified through Block device mappings on Xen. For example, launching the same AMI on t2.medium (Xen) and t3.medium (Nitro) with "DeviceName=/dev/xvdcz": On Nitro, the device appears at /dev/nvme1n1 and …


    Read More
  • Use KillMode=process with caution: restart loop could deplete resources

    calendar Dec 12, 2025 · 4 min read · Linux systemd  ·
    Share on: twitter copy

    I recently debugged a resource leak where a systemd service kept restarting while leaving a process behind after each restart. The root cause isn't particularly interesting: a backward-incompatible third-party dependency upgrade. But the debugging process and lessons learned are. Thousands of zombie processes from a …


    Read More
  • Spawning a New Process for Socket-Activated Daemons is Error-Prone

    calendar Dec 10, 2025 · 4 min read · Linux systemd Container  ·
    Share on: twitter copy

    I recently debugged a mysterious latency issue: after migrating a systemd service from path-activation to socket-activation, there was a consistent ~1 second time-to-available latency. The culprit was a bad practice—starting the daemon program as a new process in socket-activation. Let's dive into the details. Starting …


    Read More
  • Be careful making thread-aware syscalls in Go: lock the thread

    calendar Oct 20, 2025 · 10 min read · Go Container Linux  ·
    Share on: twitter copy

    A bug caused around 0.5% of container workloads to fail to start for one internal customer. This post walks through the bug and its fix—an interesting mix of Linux namespaces, Go concurrency, and syscalls. The need to run a program in its own network namespace and mount namespace soci-snapshotter is an open-source …


    Read More
  • Speed up building Bottlerocket image in AWS CodeBuild

    calendar Oct 20, 2025 · 4 min read · Bottlerocket Docker  ·
    Share on: twitter copy

    When I first moved building Bottlerocket AMI from an EC2 host to AWS CodeBuild, I was hit by a very slow build. On an EC2 instance, I built both the x86 and Arm versions on x86 instances, and fresh builds finished in 5 minutes. However, on CodeBuild with more vCPU and memory, the build process was painfully slower. The …


    Read More
  • Mysterious Image Pull Failures: "401 Unauthorized" and "Not Found" After Migrating Containerd to v2

    calendar Oct 12, 2025 · 7 min read · container AWS  ·
    Share on: twitter copy

    Early this year, we migrated containerd from v1.7 to v2.0.5. However, we quickly noticed image pulls from Amazon Elastic Container Registry (ECR) began failing for both public and private ECR repositories. For example: 1# public ECR 2FATA[0031] failed to resolve reference …


    Read More
  • EC2 IMDS is Unstable During Early Boot: Always Retry

    calendar Sep 15, 2025 · 2 min read · Linux  ·
    Share on: twitter copy

    In Detect and fix rare cases where the primary ENI does not serve default traffic , we used IMDS "meta-data/mac" to get the primary ENI's MAC address. However, we encountered the following errors in 0.5% of EC2 ARM instance launches: 1failed to get IMDS /mac: operation error ec2imds: GetMetadata, exceeded …


    Read More
  • Who Modified My Program in Bottlerocket?

    calendar Sep 11, 2025 · 2 min read · Linux Bottlerocket  ·
    Share on: twitter copy

    There are a few programs we install in Bottlerocket that cannot be built from source. For these programs, we download the binary from a secure repository and install it using an RPM spec like this: 1# foo.spec 2Name: %{_cross_os}foo 3 4Source0: foo 5 6%install 7install -d %{buildroot}%{_cross_sbindir} 8install -D -p -m …


    Read More
    • ««
    • «
    • 1
    • 2
    • 3
    • 4
    • 5
    • »
    • »»

Peng Zhang

Software Engineer

Recent Posts

  • Introduce cached-imds-client: cache static IMDS responses to improve robustness
  • Using WaitGroup to Track Work Items, Not Workers: A Multi-threaded BFS Example
  • Simplify device path on boot with udev
  • Use KillMode=process with caution: restart loop could deplete resources
  • Spawning a New Process for Socket-Activated Daemons is Error-Prone
  • Be careful making thread-aware syscalls in Go: lock the thread
  • Speed up building Bottlerocket image in AWS CodeBuild
  • Mysterious Image Pull Failures: "401 Unauthorized" and "Not Found" After Migrating Containerd to v2

Tags

GO 19 LINUX 19 ALGORITHMS 8 BOTTLEROCKET 7 INTERVIEW 7 CONTAINER 5 GUIDE 3 AWS 2 CONCURRENCY 2 DISTRIBUTED-SYSTEM 2 SELINUX 2 SYSTEMD 2 WEB 2 COMPUTER-ARCHITECTURE 1
All Tags
ALGORITHMS8 AWS2 BOTTLEROCKET7 COMPUTER-ARCHITECTURE1 CONCURRENCY2 CONTAINER5 CRYPTOGRAPHY1 DATABASES1 DISTRIBUTED-SYSTEM2 DOCKER1 EC21 GO19 GUIDE3 INTERVIEW7 LINUX19 SELINUX2 SHELL1 SYSTEMD2 TESTING1 WEB2
[A~Z][0~9]
Peng Zhang

Copyright 2022-  PENG ZHANG. All Rights Reserved

to-top