Near constant time rotate that does not violate the standards

I’ve linked to this answer for the full details from several other “rotate” questions, including this community wiki question, which should be kept up to date with best-practices.

I found a blog post about this issue, and it looks like it’s finally a solved problem (with new-enough compiler versions).

John Regehr at the University of Utah recommends version “c” of his attempts at making a rotate function. I replaced his assert with a bitwise AND, and found that it still compiles to a single rotate insn.

typedef uint32_t rotwidth_t;  // parameterize for comparing compiler output with various sizes

rotwidth_t rotl (rotwidth_t x, unsigned int n)
{
  const unsigned int mask = (CHAR_BIT*sizeof(x)-1);  // e.g. 31

  assert ( (n<=mask)  &&"rotate by type width or more");
  n &= mask;  // avoid undef behaviour with NDEBUG.  0 overhead for most types / compilers
  return (x<<n) | (x>>( (-n)&mask ));
}

rotwidth_t rot_const(rotwidth_t x)
{
  return rotl(x, 7);
}

This could be templated on x’s type, but it probably makes more sense for real use, to have the width in the function name (like rotl32). Usually when you’re rotating, you know what width you want, and that matters more than what size variable you’re currently storing the value in.

Also make sure to only use this with unsigned types. Right-shift of signed types does an arithmetic shift, shifting in sign-bits. (It’s technically implementation-dependent behaviour, but everything uses 2’s complement now.)

Pabigot independently came up with the same idea before I did, and posted it at gibhub. His version has C++ static_assert checking to make it a compile-time error to use a rotate count outside the range for the type.

I tested mine with gcc.godbolt.org, with NDEBUG defined, for variable and compile-time-const rotate counts:

  • gcc: optimal code with gcc >= 4.9.0, non-branching neg+shifts+or with earlier.
    (compile-time const count: gcc 4.4.7 is fine)
  • clang: optimal code with clang >= 3.5.0, non-branching neg+shifts+or with earlier.
    (compile-time const rotate count: clang 3.0 is fine)
  • icc 13: optimal code.
    (compile-time const count with -march=native: generates slower shld $7, %edi, %edi. Fine without -march=native)

Even newer compiler versions can handle the commonly-given code from wikipedia (included in the godbolt sample) without generating a branch or cmov. John Regehr’s version has the advantage of avoiding undefined behaviour when the rotate count is 0.

There are some caveats with 8 and 16 bit rotates, but compilers seem fine with 32 or 64, when n is uint32_t. See the comments in the code on the godbolt link for some notes from my testing various widths of uint*_t. Hopefully this idiom will be better-recognized by all compilers for more combinations of type widths in the future. Sometimes gcc will uselessly emit an AND insn on the rotate count, even though the x86 ISA defines the rotate insns with that exact AND as the first step.

“optimal” means as efficient as:

# gcc 4.9.2 rotl(unsigned int, unsigned int):
    movl    %edi, %eax
    movl    %esi, %ecx
    roll    %cl, %eax
    ret
# rot_const(unsigned int):
    movl    %edi, %eax
    roll    $7, %eax
    ret

When inlined, the compiler should be able to arrange for values to be in the right registers in the first place, resulting in just a single rotate.

With older compilers, you’ll still get ideal code when the rotate count is a compile-time constant. Godbolt lets you test with ARM as a target, and it used a rotate there, too. With variable counts on older compilers, you get a bit of code bloat, but no branches or major performance problems, so this idiom should be safe to use in general.

BTW, I modified John Regehr’s original to use CHAR_BIT*sizeof(x), and gcc / clang / icc emit optimal code for uint64_t as well. However, I did notice that changing x to uint64_t while the function return type is still uint32_t makes gcc compile it to shifts/or. So be careful to cast the result to 32bits in a separate sequence point, if you want the low 32b of a 64b rotate. i.e. Assign the result to a 64bit variable, then cast/return it. icc still generates a rotate insn, but gcc and clang don’t, for

// generates slow code: cast separately.
uint32_t r = (uint32_t)( (x<<n) | (x>>( -n&(CHAR_BIT*sizeof(x)-1) )) );

If anyone can test this with MSVC, it would be useful to know what happens there.

Leave a Comment