False sharing: a look at cache line and write combining

Modern CPUs operate significantly faster than memory. A 4.5 GHz x86_64 CPU operates 30 times faster than 6000 MHz DDR5 memory with CAS Latency 36. When accounting for latencies from the bus and memory coherency protocols, memory can be 100 times slower than registers. To mitigate this speed gap, CPUs use layers of caches organized around cache lines, typically 64 bytes each. However, programming languages operate at the byte level. This mismatch between byte-level abstractions and cache line realities can create a false sharing: a false sense of data independence. In this post, we'll look at two examples false sharing caused by cache lines and write combining respectively.

Cache Line

A cache line is the unit of data transferred between CPU caches and memory. Modern architectures have three cache layers: L1, L2, and L3. Each CPU core has its own L1 and L2 caches, while all cores share the L3 cache. Cache coherency protocols ensure cores maintain a consistent view of memory. When two cores read and write different memory addresses within the same cache line, cache line contention occurs.

Experiment 1: Two goroutines write to two 64-bit integers. Each goroutine performs one billion writes. In the control group, the two integers share the same cache line. In the feature group, they reside on different cache lines.

 1// control group: cacheline/shared-write/main.go
 2func main() {
 3  t := time.Now()
 4  // Using array to allocate continuous memory.
 5  var a [2]A
 6  cnt := 1_000_000_000
 7
 8  var wg sync.WaitGroup
 9  wg.Add(2)
10  for i := 0; i < 2; i++ {
11    go func(p *A) {
12      defer wg.Done()
13      for j := 0; j < cnt; j++ {
14        p.x += int64(j) // concurrent write
15      }
16    }(&a[i])
17  }
18  wg.Wait()
19  fmt.Printf("takes %d ns\n", time.Since(t).Nanoseconds())
20}
21
22type A struct {
23  x int64
24}
1// feature group: cacheline/independent-write/main.go
2type A struct {
3  x   int64
4  // Padding 56 bytes ensures two consecutive A structs have their x fields in different cache lines.
5  padding [7]int64 
6}

Running both programs five times shows the control group is consistently 40% slower than the feature group. Latency is measured in microseconds.

avg min max latency
shared-write 2_170_463 2_142_409 2_180_941 [2_175_778, 2_180_095, 2_142_409, 2_180_941, 2_173_094]
independent-write 1_550_810 1_545_963 1_557_014 [1_550_849, 1_550_175, 1_550_051, 1_557_014, 1_545_963]

Padding mitigates cache line contention in real-world scenarios. Examples include:

  1. poolLocal in Go's sync.Pool:
1type poolLocal struct {
2    poolLocalInternal
3
4    // Prevents false sharing on widespread platforms with
5    // 128 mod (cache line size) = 0 .
6    pad [128 - unsafe.Sizeof(poolLocalInternal{})%128]byte
7}
  1. lmax/disruptor/RingBuffer.java from Disruptor, which pads 56 bytes:
 1abstract class RingBufferPad {
 2  // 56 bytes, same size as the [7]int64 in the feature group.
 3  protected byte
 4    p10, p11, p12, p13, p14, p15, p16, p17,
 5    p20, p21, p22, p23, p24, p25, p26, p27,
 6    p30, p31, p32, p33, p34, p35, p36, p37,
 7    p40, p41, p42, p43, p44, p45, p46, p47,
 8    p50, p51, p52, p53, p54, p55, p56, p57,
 9    p60, p61, p62, p63, p64, p65, p66, p67,
10    p70, p71, p72, p73, p74, p75, p76, p77;
11}

Write Combining

Write combining (WC) buffers multiple writes to the same cache line into a single write. According to Intel:

Write combining (WC) improves performance in two ways:

  1. On a write miss to the first-level cache, it allows multiple stores to the same cache line to occur before that cache line is read for ownership (RFO) from further out in the cache/memory hierarchy. The rest of the line is then read, and unwritten bytes are combined with the unmodified bytes in the returned line.
  2. Write combining allows multiple writes to be assembled and written further out in the cache hierarchy as a unit, saving port and bus traffic. This is particularly important for avoiding partial writes to uncached memory.

Assembly/Compiler Coding Rule 51. (High impact, Low generality) If an inner loop writes to more than four arrays (four distinct cache lines), apply loop fission to break up the loop body so only four arrays are written to in each iteration of the resulting loops.

Experiment 2: Write to eight arrays one billion times. Each array contains one million elements of eight bytes each. The control group uses a single loop. The feature group uses two loops, each writing to four arrays.

 1// control group: one-byte-one-loop/main.go
 2type A [1]uint8
 3
 4func main() {
 5  arraySize := 1 << 20
 6  d0 := make([]A, arraySize)
 7  d1 := make([]A, arraySize)
 8  d2 := make([]A, arraySize)
 9  d3 := make([]A, arraySize)
10  d4 := make([]A, arraySize)
11  d5 := make([]A, arraySize)
12  d6 := make([]A, arraySize)
13  d7 := make([]A, arraySize)
14
15  // Write a constant value, 2.
16  var value A = [1]uint8{2}
17  loopCount := 1_000_000_000
18
19  mask := arraySize - 1
20  t := time.Now()
21  for i := 0; i < loopCount; i++ {
22    j := i & mask
23    d0[j] = value
24    d1[j] = value
25    d2[j] = value
26    d3[j] = value
27    d4[j] = value
28    d5[j] = value
29    d6[j] = value
30    d7[j] = value
31  }
32  fmt.Printf("takes %d ns\n", time.Since(t).Nanoseconds())
33}
 1// one-byte-two-loops/main.go 
 2  for i := 0; i < loopCount; i++ {
 3    j := i & mask
 4    d0[j] = value
 5    d1[j] = value
 6    d2[j] = value
 7    d3[j] = value
 8  }
 9  for i := 0; i < loopCount; i++ {
10    j := i & mask
11    d4[j] = value
12    d5[j] = value
13    d6[j] = value
14    d7[j] = value
15  }

Running both programs five times shows the control group is consistently 51% slower than the feature group. Latency is measured in milliseconds.

avg min max latency
one-byte-one-loop 3_828 3_616 3_977 [3_965, 3_616, 3_977, 3_810, 3_774]
one-byte-two-loops 2_523 2_504 2_538 [2_533, 2_527, 2_538, 2_517, 2_504]

Without enough WC buffers for eight arrays, the one-loop version incurs partial writes—writes of less than 64 bytes of modified data. The number of WC buffers is quite small, ranging from six to eight, and one process may not use all available buffers. This experiment demonstrates Intel's "Assembly/Compiler Coding Rule 51," which suggests compilers break up loops writing to more than four arrays into multiple loops that each write to four arrays. The Go compiler doesn't implement this rule, as shown in the assembly code below.

 1// go build -gcflags="-S" one-byte-one-loop/main.go 
 2XORL    R13, R13
 3JMP    337
 4MOVQ    R13, R15
 5ANDL    $1048575, R15
 6MOVB    $1, (SI)(R15*1)
 7MOVB    $1, (R12)(R15*1)
 8MOVB    $1, (R11)(R15*1)
 9MOVB    $1, (R10)(R15*1)
10MOVB    $1, (R9)(R15*1)
11MOVB    $1, (R8)(R15*1)
12MOVB    $1, (DI)(R15*1)
13MOVB    $1, (DX)(R15*1)
14INCQ    R13
15CMPQ    R13, $1000000000
16JLT    284
17
18// go build -gcflags="-S" one-byte-two-loops/main.go 
19XORL    R9, R9
20JMP    297
21MOVQ    R9, R10
22ANDL    $1048575, R10
23MOVB    $1, (SI)(R10*1)
24MOVB    $1, (R8)(R10*1)
25MOVB    $1, (DI)(R10*1)
26MOVB    $1, (DX)(R10*1)
27INCQ    R9
28CMPQ    R9, $1000000000
29JLT    264
30...
31XORL    R9, R9
32JMP    364
33MOVQ    R9, R10
34ANDL    $1048575, R10
35MOVB    $1, (SI)(R10*1)
36MOVB    $1, (R8)(R10*1)
37MOVB    $1, (DI)(R10*1)
38MOVB    $1, (DX)(R10*1)
39INCQ    R9
40CMPQ    R9, $1000000000
41JLT    331

Since WC combines writes to a cache line, it provides no benefit when each write is already 64 bytes. After changing type A from [1]uint8 to [64]uint8, both one-loop and two-loops take nearly the same time.

avg min max latency
64-bytes-one-loop 67_442 66_953 67_877 [67_071, 66_953, 67_447, 67_877, 67_863]
64-bytes-two-loops 65_477 65_293 65_820 [65_293, 65_293, 65_415, 65_820, 65_567]

Conclusion

High-level programming languages abstract away computer architecture details. These two examples show that lack of awareness can lead to a false sense of data independence and performance degradation, especially when compilers cannot optimize automatically.

Further reading

  1. Intel® 64 and IA-32 Architectures Optimization Reference Manual: Volume 1 3.6.9 Write Combining
  2. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1 12.3 Methods Of Caching Available

Thank you for reading! Your feedback is greatly appreciated, feel free to comment.