False Data Independency: A Look at Cache Line and Write Combining

Modern CPUs operate significantly faster than memories. For instance, a 4.5 GHz x64 CPU operates 30 times faster than a 6000 MHz DDR5 memory of CAS Latency 36. Adding latencies incurred by the bus and memory coherency protocols, CPU access to memory could be 100 times slower than CPU access to registers. To mitigate the speed gap, CPU incorporates layers of caches, typically with a cache line size of 64 bytes. However, at the programming language level, the unit of memory size is one byte. This discrepancy in memory unit, byte vs cache line, can lead programmers to a false sense of data independence. In this blog, we examine two examples that illustrate data dependencies caused by two CPU techniques: cache line and write combining. While the examples are written in Go, the principles apply to other programming languages as well.

Cache Line

A cache line is the unit of data transferred between CPU caches and memory. In modern architectures, there are three layers of caches, L1, L2 and L3. Each CPU core has its own L1 and L2 caches, while all cores share the L3 cache. Cache coherency protocols ensures that cores maintain a consistent view of memory. When two cores read and write different memory addresses from the same cache line, cache line contention cost becomes apparent, as shown in the following experiment.

Experiment 1: Two go routines write two 64-bit integers. Each go routine performs one billion writes. In the control group, the two integers are on the same cache line. In the feature group, the two integers are on different cache lines.

 1// control group: cacheline/shared-write/main.go
 2func main() {
 3  t := time.Now()
 4  // Using array to allocate continuous memory.
 5  var a [2]A
 6  cnt := 1_000_000_000
 7
 8  var wg sync.WaitGroup
 9  wg.Add(2)
10  for i := 0; i < 2; i++ {
11    go func(p *A) {
12      defer wg.Done()
13      for j := 0; j < cnt; j++ {
14        p.x += int64(j) // concurrent write
15      }
16    }(&a[i])
17  }
18  wg.Wait()
19  fmt.Printf("takes %d ns\n", time.Since(t).Nanoseconds())
20}
21
22type A struct {
23  x int64
24}
1// feature group: cacheline/independent-write/main.go
2type A struct {
3  x   int64
4  // padding 56 bytes so that two consecutive A have their x in two different cache lines.
5  pad [7]int64 
6}

Run the two programs five times. Consistently, the control group is 40% slower than the feature group. Latency is measured in microseconds.

avg min max latency
shared-write 2_170_463 2_142_409 2_180_941 [2_175_778, 2_180_095, 2_142_409, 2_180_941, 2_173_094]
independent-write 1_550_810 1_545_963 1_557_014 [1_550_849, 1_550_175, 1_550_051, 1_557_014, 1_545_963]

The experiment is contrived, you may say. However, in real-world scenarios, padding is used to mitigate cache line contention. For example, the lmax/disruptor/RingBuffer.java from Disruptor pads 56 bytes.

 1abstract class RingBufferPad {
 2  // 56 bytes. Same size as the [7]int64 in the feature group.
 3  protected byte
 4    p10, p11, p12, p13, p14, p15, p16, p17,
 5    p20, p21, p22, p23, p24, p25, p26, p27,
 6    p30, p31, p32, p33, p34, p35, p36, p37,
 7    p40, p41, p42, p43, p44, p45, p46, p47,
 8    p50, p51, p52, p53, p54, p55, p56, p57,
 9    p60, p61, p62, p63, p64, p65, p66, p67,
10    p70, p71, p72, p73, p74, p75, p76, p77;
11}

Write Combining

The second CPU technique is Write Combining (WC). WC buffers multiple writes on the same cache line to one write unit. According to Intel's Optimization Reference Manual,

Write combining (WC) improves performance in two ways:

  1. On a write miss to the first-level cache, it allows multiple stores to the same cache line to occur before that cache line is read for ownership (RFO) from further out in the cache/memory hierarchy. Then the rest of line is read, and the bytes that have not been written are combined with the unmodified bytes in the returned line.
  2. Write combining allows multiple writes to be assembled and written further out in the cache hierarchy as a unit. This saves port and bus traffic. Saving traffic is particularly important for avoiding partial writes to uncached memory.

Assembly/Compiler Coding Rule 51. (H impact, L generality) If an inner loop writes to more than four arrays (four distinct cache lines), apply loop fission to break up the body of the loop so only four arrays are being written to in each iteration of each of the resulting loops.

We design the following experiment on an Intel CPU to demonstrate how write combining reduces cache line writes.

Experiment 2. Write one billion times to eight arrays. Each array contains one million elements, with each element being eight bytes long. The control group uses a single loop for writing. The feature group uses two loops, with each loop writing to four arrays.

 1// control group: one-byte-one-loop/main.go
 2type A [1]uint8
 3
 4func main() {
 5  arraySize := 1 << 20
 6  d0 := make([]A, arraySize)
 7  d1 := make([]A, arraySize)
 8  d2 := make([]A, arraySize)
 9  d3 := make([]A, arraySize)
10  d4 := make([]A, arraySize)
11  d5 := make([]A, arraySize)
12  d6 := make([]A, arraySize)
13  d7 := make([]A, arraySize)
14
15  // writes a constant, 2.
16  var value A = [1]uint8{2}
17  loopCount := 1_000_000_000
18
19  mask := arraySize - 1
20  t := time.Now()
21  for i := 0; i < loopCount; i++ {
22    j := i & mask
23    d0[j] = value
24    d1[j] = value
25    d2[j] = value
26    d3[j] = value
27    d4[j] = value
28    d5[j] = value
29    d6[j] = value
30    d7[j] = value
31  }
32  fmt.Printf("takes %d ns\n", time.Since(t).Nanoseconds())
33}
 1//one-byte-two-loops/main.go 
 2  for i := 0; i < loopCount; i++ {
 3    j := i & mask
 4    d0[j] = value
 5    d1[j] = value
 6    d2[j] = value
 7    d3[j] = value
 8  }
 9  for i := 0; i < loopCount; i++ {
10    j := i & mask
11    d4[j] = value
12    d5[j] = value
13    d6[j] = value
14    d7[j] = value
15  }

Run the two programs five times. Consistently, the control group is 51% slower than the feature group. Latency is measured in milliseconds.

avg min max latency
one-byte-one-loop 3_828 3_616 3_977 [3_965, 3_616, 3_977, 3_810, 3_774]
one-byte-two-loops 2_523 2_504 2_538 [2_533, 2_527, 2_538, 2_517, 2_504]

Because there are not enough WC buffer for eight arrays, the one-loop incurs partial writes. A partial write is a write of less than 64 bytes of modified data. The number of WC buffers is quite small, ranging from six to eight. And one process may not use all available WC buffers. You may have noticed that, the experiment precisely demonstrates "Assembly/Compiler Coding Rule 51" from the Intel guide mentioned earlier. The rule suggests compilers to break up an inner loop that writes to more than four arrays to loops that write to four arrays each. Our experiment indicates that Go compiler does not implement the rule, as shown in the assembly code below.

 1// go build -gcflags="-S" one-byte-one-loop/main.go 
 2XORL	R13, R13
 3JMP	337
 4MOVQ	R13, R15
 5ANDL	$1048575, R15
 6MOVB	$1, (SI)(R15*1)
 7MOVB	$1, (R12)(R15*1)
 8MOVB	$1, (R11)(R15*1)
 9MOVB	$1, (R10)(R15*1)
10MOVB	$1, (R9)(R15*1)
11MOVB	$1, (R8)(R15*1)
12MOVB	$1, (DI)(R15*1)
13MOVB	$1, (DX)(R15*1)
14INCQ	R13
15CMPQ	R13, $1000000000
16JLT	284
17
18// go build -gcflags="-S" one-byte-two-loops/main.go 
19XORL	R9, R9
20JMP	297
21MOVQ	R9, R10
22ANDL	$1048575, R10
23MOVB	$1, (SI)(R10*1)
24MOVB	$1, (R8)(R10*1)
25MOVB	$1, (DI)(R10*1)
26MOVB	$1, (DX)(R10*1)
27INCQ	R9
28CMPQ	R9, $1000000000
29JLT	264
30...
31XORL	R9, R9
32JMP	364
33MOVQ	R9, R10
34ANDL	$1048575, R10
35MOVB	$1, (SI)(R10*1)
36MOVB	$1, (R8)(R10*1)
37MOVB	$1, (DI)(R10*1)
38MOVB	$1, (DX)(R10*1)
39INCQ	R9
40CMPQ	R9, $1000000000
41JLT	331

Since WC combines writes to a cache line, it does not provide benefits if each write is 64 bytes. After changing type A from [1]uint8 to [64]uint8, both the one-loop and two-loops take almost the same amount of time.

avg min max latency
64-bytes-one-loop 67_442 66_953 67_877 [67_071, 66_953, 67_447, 67_877, 67_863]
64-bytes-two-loops 65_477 65_293 65_820 [65_293, 65_293, 65_415, 65_820, 65_567]

Conclusion

High-level programming languages abstract away computer architectures. This blog provides two examples illustrating how lack of awareness of this abstraction can impact performance. It serves as an example of mechanical sympathy. In most cases, for most applications, we need not concern ourselves with underlying hardware architectures. However, for critical sections where hardware architectures may impact performance, benchmark first and carefully weigh the trade-off between code readability and performance.

Further reading

  1. Intel® 64 and IA-32 Architectures Optimization Reference Manual: Volume 1 3.6.9 Write Combining
  2. Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3A: System Programming Guide, Part 1 12.3 Methods Of Caching Available