Find the first instance of a character using simd

You have the right idea with _mm256_cmpeq_epi8 -> _mm256_movemask_epi8. AFAIK, that’s the optimal way to implement this for Intel CPUs at least. PMOVMSKB r32, ymm is the same speed as the XMM 16-byte version, so it would be a huge loss to unpack the two lanes of a 256b vector and movemask them separately and then recombine the integer results. (Source: Agner Fog’s instruction table. See other perf links in the tag wiki.)

Make the code inside the loop as efficient as possible by leaving the ffs until after you’ve identified a non-zero result from _mm256_movemask_epi8.

TEST/JCC can macro fuse into a single uop, but BSF/JCC doesn’t, so it takes an extra instruction. (And you’d be hard-pressed to get a C compiler to emit BSF/JCC anyway. More likely branching on the result of ffs would give you some kind of test for the input being non-zero, then BSF, then add 1, then compare-and-branch. That’s obviously horrible compared to just testing the movemask result.)

(Update, in C++20, use std::countr_zero. It can compile to a single tzcnt, instead of the off-by-one of ffs. Since you’ve already checked for the mask being non-zero, hopefully can optimize to a single (rep) bsf instruction if it isn’t sure all CPUs running the code will support tzcnt. If you can assume BMI1 in your target CPUs, which you usually can for AVX2 code, then enable that so you’ll reliably get an efficient tzcnt.)

Also note that for similar problems, comparing the movemask (e.g. to check that it’s 0xFFFFFFFF) is just as efficient as branching on it being non-zero.


As Paul R suggested, looking at some strlen, strchr, and memchr implementations may be informative. There are multiple hand-written asm implementations in open-source libc implementations, and other places. (e.g. glibc, and Agner Fog’s asmlib.)

Many of glibc’s versions scan up to an alignment boundary, then use an unrolled loop that reads 64B at a time (in 4 SSE vectors, since I don’t think glibc has an AVX2 version).

To optimize for long strings, reduce overhead from testing the compare results by ORing the compare results together, and check that. If you find a hit, go back and re-test your vectors to see which vector had the hit.

It may be somewhat more efficient to do the ffs on one 64-bit integer that you built up out of multiple movemask results (with shift and |). I’m not sure about doing this inside the loop before testing for zero; I don’t remember if one of glibc’s strlen strategies did that or not.


Everything I’ve suggested here is stuff can be seen in asm in various glibc strategies for strlen, memchr, and related functions. Here’s sysdeps/x86_64/strlen.S, but I there may be another source file somewhere using more than baseline SSE2. (Or not, I might be thinking of a different function, maybe there’s nothing to be gained beyond SSE2, until AVX (3-operand insns) and AVX2 (256b integer vectors).

See also:

  • glibc’s strchr-avx2.S (Woboq.org has a nice source browser with a useful search for filenames / symbols).
  • glibc’s memchr-avx2.S

glibc’s memchr uses PMAXUB instead of POR. I’m not sure if that’s useful for some arcane microarchitectural reason, but it runs on fewer ports on most CPUs. Perhaps that’s desired, to avoid resource conflicts with something else? IDK, seems weird, since it competes with PCMPEQB.

Leave a Comment