Compare 16 byte strings with SSE

Vector comparison instructions produce their result as a mask, of elements that are all-1s (true) or all-0s (false) according to the comparison between the corresponding source elements.

See https://stackoverflow.com/tags/x86/info for some links that will tell you what those intrinsics do.

The code in the question looks like it should work.

If you want to find out which elements were non-equal, then use the movemask version (pmovmskb or movmskps). You can tzcnt / bsf to bit-scan for the first match, or popcnt to count matches. All-equal gives you a 0xffff mask, non-equal gives you 0.


You might wonder if SSE4.1 ptest is useful here. It’s usable but it’s not actually faster, especially if you’re branching on the result instead of turning it into a boolean 0 / -1.

 // slower alternative using SSE4.1 ptest
__m128i avec, bvec;
avec = _mm_loadu_si128((__m128i*)(a)); 
bvec = _mm_loadu_si128((__m128i*)(b)); 

__m128i diff = _mm_xor_si128(avec, bvec);  // XOR: all zero only if *a==*b

if(_mm_test_all_zeros(diff, diff))  { //equal 
} else {   //not equal 
}

In asm, ptest is 2 uops, and can’t macro-fuse with a jcc conditional branch. So the total pxor + ptest + branch is 4 uops for the front-end, and still destroys one of the inputs unless you have AVX to put the xor result in a 3rd register.

pcmpeqb xmm0, xmm1 / pmovmskb eax, xmm0 / cmp/jcc is a total of 3 uops, with the cmp/jcc fusing into 1 uop on Intel and AMD CPUs.

If you have wider elements, you can use movmskps or movmskpd on the result of pcmpeqd/q to get a 4-bit or 2-bit mask. This is most useful if you want to bit-scan or popcnt without dividing by 4 or 8 bytes per element. (Or with AVX2, 8-bit or 4-bit instead of 32-bit mask.)

ptest is only a good idea if you don’t need any extra instruction to build an input for it: test for all-zeros or not, with or without a mask. e.g. to check some bits in every element, or in some elements.

Leave a Comment