Working Knowledge of Linux Memory: Concepts
I recently dealt with a server livelock issue caused by memory page thrashing. This post refreshes the Linux memory basics I found useful for debugging the issue. Much of the content is from Chapter 7 of Systems Performance: Enterprise and the Cloud.
Virtual Memory
Virtual memory is an abstraction that provides each process and the kernel with its own large, linear, and private address space. Virtual memory is also referred to as process virtual address space. The address space is split into areas called segments for storing thread stacks, process executables, libraries, and the heap.
- Executable text: Contains the executable CPU instructions for the process. This is mapped from the text segment (.text in ELF) of the binary program or libraries on the file system. It is read-only.
- Executable data: Contains initialized variables mapped from the data segment (.data in ELF) of the binary program. This has read/write permissions so variables can be modified while the program runs. It also has a private flag so modifications are not flushed to disk.
- Heap: The working memory for the program; it is anonymous memory (no file system location) that grows as needed.
- Stack: Stacks of the running threads, mapped read/write.
Different instruction sets (x86-64 and arm64) have different segment layouts.
Virtual addresses are not physical addresses. In Linux, the real memory is physical memory (RAM), also referred to as main memory.
The CPU cannot directly use virtual memory addresses. Virtual-to-physical address translation happens in the Memory Management Unit (MMU). The I$, D$, and E$ stand for instruction cache, data cache, and external cache, respectively. The Translation Lookaside Buffer (TLB) caches recent translations. Both MMU and TLB are often part of the CPU.

Demand Paging
A page is a unit of memory. It is normally 4KiB, but Linux supports huge pages of 2MiB, 4MiB, and 1GiB on x86_64. Often, a "page" means a contiguous chunk of virtual memory, and a "page frame" means a contiguous chunk of physical memory. Contiguous virtual memory is not guaranteed to be contiguous in physical memory. Thus, Direct Memory Access (DMA) cannot be done on virtual memory—DMA must be supported by the device driver and programmed with driver libraries.
Paging is the movement of pages between main memory and storage devices. There are two types of pages:
- File-based pages are pages in memory-mapped files, including:
- Explicit memory mappings, e.g., mmap syscall for flags that are not MAP_ANONYMOUS
- The page cache in the Linux Virtual File System (VFS) and most file systems
- Anonymous pages are data that is private to processes: the process heap and stacks
Moving pages into and out of main memory are called page-in and page-out, respectively. Page-out doesn't necessarily write to the storage device. For example, for clean file system pages, page-out simply drops the page since there is a clean copy on disk.
Swapping is a special type of paging that moves anonymous pages between main memory and the swap area. Swapping allows the kernel to oversubscribe main memory. Swap is an on-disk area for paged anonymous data. It may be a dedicated partition on a storage device (swap device) or a file system file (swap file). The swap area can have multiple devices and files. In that case, Linux supports setting priority for each swap device/area, where high-priority swap is used before low-priority swap.
Demand paging maps pages of virtual memory to physical memory on demand.

A few notes about Figure 7.2:
malloc
is not a syscall; it's a function in high-level language libraries likeglibc
. How malloc is implemented depends on the configured runtime memory allocator. Popular ones includetcmalloc
andjemalloc
.- "2. store" can also be a
load
on a virtual memory address.
Any page of virtual memory may be in one of the following states:
- A. Unallocated
- B. Allocated, but unmapped (unpopulated and not yet faulted)
- C. Allocated, and mapped to main memory (RAM)
- D. Allocated, and mapped to the physical swap device (disk)
State (D) is reached if the page is paged out due to system memory pressure. A transition from (B) to (C) is a page fault. If it requires disk I/O, it is a major page fault; otherwise, a minor page fault. From these states, two memory usage terms can also be defined:
- Resident set size (RSS): The size of allocated main memory pages (C)
- Virtual memory size (VSZ): The size of all allocated areas (B + C + D)
Memory Reclaim
Overcommit allocates more memory than the system can possibly store—more than physical memory and swap devices combined. Overcommit relies on demand paging and the tendency of applications to not use much of the memory they have allocated. Because of overcommit, memory allocations like malloc almost never fail in practice. The problem occurs when allocated memory is demanded while the kernel cannot deliver. This is referred to as memory saturation or memory pressure.
Memory reclaiming at different memory pressure levels is as follows:

The Linux swappiness parameter controls whether to favor freeing memory by swapping or by reclaiming it from the page cache. It is a number between 0 and 100 (default is 60), where higher values favor swapping. In practice, most systems do not use swap, so swappiness is irrelevant. The reason for disabling swap is that in distributed systems, an unavailable instance (OOM kill) is preferable to a slow instance (swap is slower than main memory). Slow instances cause unpredictable and confusing customer experiences, while unavailable instances can be taken out of service and traffic redirected to healthy instances.
Reaping mostly involves freeing memory from the kernel slab allocator caches. The kernel slab allocator allocates and deallocates kernel objects and data structures.
The out-of-memory (OOM) killer frees memory by finding and killing a sacrificial process using select_bad_process() and then killing it by calling oom_kill_process(). The oom_score_adj
is a number between -1000 and 1000 that can be set per process (/proc/$PID/oom_score_adj). The higher the score, the higher the chance the process is selected for termination.
Memory allocator
Memory allocators allocate memory. The Linux kernel allocates physical memory using the "buddy system" that divides memory into fixed-size orders where each order doubles the size of the previous order. For small frequent allocations, Linux uses the SLUB slab allocator that builds on top of the buddy allocator.
Applications interact with virtual memory through user-space memory allocators. Popular ones include glibc's malloc, jemalloc, and tcmalloc. Applications can implement their own allocators as well. For example, the Go runtime uses its own allocator designed for Go's unique concurrency model and garbage collection system. See malloc.go, mcentral.go, mcache.go, and mheap.go in the runtime package.
NUMA
Non-Uniform Memory Access (NUMA) is a memory architecture in modern high-performance CPUs. In NUMA, memory connected to each CPU is referred to as memory nodes, or just nodes. Linux is aware of the node topology and can optimize thread scheduling based on memory locality.