Matrix multiplication: Small difference in matrix size, large difference in timings

Here’s my wild guess: cache

It could be that you can fit 2 rows of 2000 doubles into the cache. Which is slighly less than the 32kb L1 cache. (while leaving room other necessary things)

But when you bump it up to 2048, it uses the entire cache (and you spill some because you need room for other things)

Assuming the cache policy is LRU, spilling the cache just a tiny bit will cause the entire row to be repeatedly flushed and reloaded into the L1 cache.

The other possibility is cache associativity due to the power-of-two. Though I think that processor is 2-way L1 associative so I don’t think it matters in this case. (but I’ll throw the idea out there anyway)

Possible Explanation 2: Conflict cache misses due to super-alignment on the L2 cache.

Your B array is being iterated on the column. So the access is strided. Your total data size is 2k x 2k which is about 32 MB per matrix. That’s much larger than your L2 cache.

When the data is not aligned perfectly, you will have decent spatial locality on B. Although you are hopping rows and only using one element per cacheline, the cacheline stays in the L2 cache to be reused by the next iteration of the middle loop.

However, when the data is aligned perfectly (2048), these hops will all land on the same “cache way” and will far exceed your L2 cache associativity. Therefore, the accessed cache lines of B will not stay in cache for the next iteration. Instead, they will need to be pulled in all the way from ram.

Leave a Comment