Arm Neon Intrinsics vs hand assembly

My experience is that the intrinsics haven’t really been worth the trouble. It’s too easy for the compiler to inject extra register unload/load steps between your intrinsics. The effort to get it to stop doing that is more complicated than just writing the stuff in raw NEON. I’ve seen this kind of stuff in pretty recent compilers (including clang 3.1).

At this level, I find you really need to control exactly what’s happening. You can have all kinds of stalls if you do things in just barely the wrong order. Doing it in intrinsics feels like surgery with welder’s gloves on. If the code is so performance critical that I need intrinsics at all, then intrinsics aren’t good enough. Maybe others have difference experiences here.

Leave a Comment