Efficient 4×4 matrix multiplication (C vs assembly)
4×4 matrix multiplication is 64 multiplications and 48 additions. Using SSE this can be reduced to 16 multiplications and 12 additions (and 16 broadcasts). The following code will do this for you. It only requires SSE (#include <xmmintrin.h>). The arrays A, B, and C need to be 16 byte aligned. Using horizontal instructions such as … Read more