VS: unexpected optimization behavior with _BitScanReverse64 intrinsic

AFAICT, the intrinsic leaves garbage in index when the input is zero, weaker than the behaviour of the asm instruction. This is why it has a separate boolean return value and integer output operand.

Despite the index arg being taken by reference, the compiler treats it as output-only.

unsigned char _BitScanReverse64 (unsigned __int32* index, unsigned __int64 mask)
Intel’s intrinsics guide documentation for the same intrinsic seems clearer than the Microsoft docs you linked, and sheds some light on what the MS docs are trying to say. But on careful reading, they do both seem to say the same thing, and describe a thin wrapper around the bsr instruction.

Intel documents the BSR instruction as producing an “undefined value” when the input is 0, but setting the ZF in that case. But AMD documents it as leaving the destination unchanged:

AMD’s BSF entry in AMD64 Architecture
Programmer’s Manual
Volume 3:
General-Purpose and
System Instructions

… If the second operand contains 0, the instruction sets ZF
to 1 and does not change the contents of the destination register. …

On current Intel hardware, the actual behaviour matches AMD’s documentation: it leaves the destination register unmodified when the src operand is 0. Perhaps this is why MS describes it as only setting Index when the input is non-zero (and the intrinsic’s return value is non-zero).

On Intel (but maybe not AMD), this goes as far as not even truncating a 64-bit register to 32-bit. e.g. mov rax,-1 ; bsf eax, ecx (with zeroed ECX) leaves RAX=-1 (64-bit), not the 0x00000000ffffffff you’d get from xor eax, 0. But with non-zero ECX, bsf eax, ecx has the usual effect of zero-extending into RAX, leaving for example RAX=3.

IDK why Intel still hasn’t documented it. Perhaps a really old x86 CPU (like original 386?) implements it differently? Intel and AMD frequently go above and beyond what’s documented in the x86 manuals in order to not break existing widely-used code (e.g. Windows), which might be how this started.

At this point it seems unlikely that Intel will ever drop that output dependency and leave actual garbage or -1 or 32 for input=0, but the lack of documentation leaves that option open.

Skylake dropped the false dependency for lzcnt and tzcnt (and a later uarch dropped the false dep for popcnt) while still preserving the dependency for bsr/bsf. (Why does breaking the “output dependency” of LZCNT matter?)

Of course, since MSVC optimized away your index = 0 initialization, presumably it just uses whatever destination register it wants, not necessarily the register that held the previous value of the C variable. So even if you wanted to, I don’t think you could take advantage of the dst-unmodified behaviour even though it’s guaranteed on AMD.

So in C++ terms, the intrinsic has no input dependency on index. But in asm, the instruction does have an input dependency on the dst register, like an add dst, src instruction. This can cause unexpected performance issues if compilers aren’t careful.

Unfortunately on Intel hardware, the popcnt / lzcnt / tzcnt asm instructions also have a false dependency on their destination, even though the result never depends on it. Compilers work around this now that it’s known, though, so you don’t have to worry about it when using intrinsics (unless you have a compiler more than a couple years old, since it was only recently discovered).

You need to check it to make sure index is valid, unless you know the input was non-zero. e.g.

if(_BitScanReverse64(&idx, input)) {
    // idx is valid.
    // (MS docs say "Index was set")
} else {
    // input was zero, idx holds garbage.
    // (MS docs don't say Index was even set)
    idx = -1;     // might make sense, one lower than the result for bsr(1)
}

If you want to avoid this extra check branch, you can use the lzcnt instruction via different intrinsics if you’re targeting new enough hardware (e.g. Intel Haswell or AMD Bulldozer IIRC). It “works” even when the input is all-zero, and actually counts leading zeros instead of returning the index of the highest set bit.

More Related Contents:

Leave a Comment Cancel reply