Why is transposing a matrix of 512×512 much slower than transposing a matrix of 513×513?

The explanation comes from Agner Fog in Optimizing software in C++ and it reduces to how data is accessed and stored in the cache.

For terms and detailed info, see the wiki entry on caching, I’m gonna narrow it down here.

A cache is organized in sets and lines. At a time, only one set is used, out of which any of the lines it contains can be used. The memory a line can mirror times the number of lines gives us the cache size.

For a particular memory address, we can calculate which set should mirror it with the formula:

set = ( address / lineSize ) % numberOfsets

This sort of formula ideally gives a uniform distribution across the sets, because each memory address is as likely to be read (I said ideally).

It’s clear that overlaps can occur. In case of a cache miss, the memory is read in the cache and the old value is replaced. Remember each set has a number of lines, out of which the least recently used one is overwritten with the newly read memory.

I’ll try to somewhat follow the example from Agner:

Assume each set has 4 lines, each holding 64 bytes. We first attempt to read the address 0x2710, which goes in set 28. And then we also attempt to read addresses 0x2F00, 0x3700, 0x3F00 and 0x4700. All of these belong to the same set. Before reading 0x4700, all lines in the set would have been occupied. Reading that memory evicts an existing line in the set, the line that initially was holding 0x2710. The problem lies in the fact that we read addresses that are (for this example) 0x800 apart. This is the critical stride (again, for this example).

The critical stride can also be calculated:

criticalStride = numberOfSets * lineSize

Variables spaced criticalStride or a multiple apart contend for the same cache lines.

This is the theory part. Next, the explanation (also Agner, I’m following it closely to avoid making mistakes):

Assume a matrix of 64×64 (remember, the effects vary according to the cache) with an 8kb cache, 4 lines per set * line size of 64 bytes. Each line can hold 8 of the elements in the matrix (64-bit int).

The critical stride would be 2048 bytes, which correspond to 4 rows of the matrix (which is continuous in memory).

Assume we’re processing row 28. We’re attempting to take the elements of this row and swap them with the elements from column 28. The first 8 elements of the row make up a cache line, but they’ll go into 8 different cache lines in column 28. Remember, critical stride is 4 rows apart (4 consecutive elements in a column).

When element 16 is reached in the column (4 cache lines per set & 4 rows apart = trouble) the ex-0 element will be evicted from the cache. When we reach the end of the column, all previous cache lines would have been lost and needed reloading on access to the next element (the whole line is overwritten).

Having a size that is not a multiple of the critical stride messes up this perfect scenario for disaster, as we’re no longer dealing with elements that are critical stride apart on the vertical, so the number of cache reloads is severely reduced.

Another disclaimer – I just got my head around the explanation and hope I nailed it, but I might be mistaken. Anyway, I’m waiting for a response (or confirmation) from Mysticial. 🙂

Leave a Comment