Bit popcount for large buffer, with Core 2 CPU (SSSE3)

See a 32 bit version in the AMD Software Optimization guide, page 195 for one implementation.
This gives you assembly code for an x86 directly.

See a variant at Stanford bit-twiddling hacks
The Stanford version looks like the best one to me.
It looks very easy to code as x86 asm.

Neither of these use branch instructions.

These can be generalized to 64 bit versions.

With the 32 or 64 bit versions, you might consider doing a SIMD version.
SSE2 will do 4 double-words or two quadwords (either way 128 bits)
at once. What you want to do is implement the popcount for 32
or 64 bits in each of the 2 or 4 registers available.
You’ll end up with 2 or 4 sets of popcounts in the XMM registers
when you are done; final step is to store and add those
popcounts together to get the final answer. Guessing,
I’d expect you do so slightly better doing 4 parallel 32
bit popcounts rather than 2 parallel 64 bit popcounts,
as the latter is likely to take 1 or 2 additional instructions
in each iteration, and its easy to add 4, 32 bit values together
the end.

Leave a Comment