The fastest way on recent x86 processors is probably to make use of the MOVMSKB family of instructions which extract the MSBs of a SIMD word and pack them into a normal integer register.
I fear SIMD intrinsics are not really my thing but something along these lines ought to work if you’ve got an AVX2 equipped processor:
uint32_t bitpack(const bool array[32]) {
__mm256i tmp = _mm256_loadu_si256((const __mm256i *) array);
tmp = _mm256_cmpgt_epi8(tmp, _mm256_setzero_si256());
return _mm256_movemask_epi8(tmp);
}
Assuming sizeof(bool) = 1
. For older SSE2 systems you will have to string together a pair of 128-bit operations instead. Aligning the array on a 32-byte boundary and should save another cycle or so.