Arm Neon Intrinsics vs hand assembly

My experience is that the intrinsics haven’t really been worth the trouble. It’s too easy for the compiler to inject extra register unload/load steps between your intrinsics. The effort to get it to stop doing that is more complicated than just writing the stuff in raw NEON. I’ve seen this kind of stuff in pretty … Read more

Methods to vectorise histogram in SIMD?

Histogramming is almost impossible to vectorize, unfortunately. You can probably optimise the scalar code somewhat however – a common trick is to use two histograms and then combine them at the end. This allows you to overlap loads/increments/stores and thereby bury some of the serial dependencies and associated latencies. Pseudo code: init histogram 1 to … Read more