UPD: This is a compiler optimization issue, obviously. While MinGW used only one div
instruction in loop body, both GCC on Linux and MSVC failed to reuse the quotient from previous iteration.
I think the best we could do is explicitly define quo
and rem
and calculate them in the same basic instruction block, to show the compiler we want both quotient and remainder.
int is_prime(uint64_t n)
{
uint64_t p = 3, quo, rem;
if (!(n & 1) || n < 2) return n == 2;
quo = n / p;
for (; p <= quo; p += 2){
quo = n / p; rem = n % p;
if (!(rem)) return 0;
}
return 1;
}
I tried your code from http://coliru.stacked-crooked.com/a/69497863a97d8953 on a MinGW-w64 compiler, case 1
is faster than case 2
.
So I guess you are compiling targeted to a 32-bit architecture and used uint64_t
type. Your assembly shows it doesn’t use any 64-bit register.
If I got it right, there is the reason.
On 32-bit architecture, 64-bit numbers is represented in two 32-bit registers, your compiler will do all concatenation works. It’s simple to do 64-bit addition, subtraction and multiplication. But modulo and division is done by a small function call which named as ___umoddi3
and ___udivdi3
in GCC, aullrem
and aulldiv
in MSVC.
So actually you need one ___umoddi3
and one ___udivdi3
for each iteration in case 1
, one ___udivdi3
and one concatenation of 64-bit multiplication in case 2
. That’s why case 1
seems twice slower than case 2
in your test.
What you really get in case 1
:
L5:
addl $2, %esi
adcl $0, %edi
movl %esi, 8(%esp)
movl %edi, 12(%esp)
movl %ebx, (%esp)
movl %ebp, 4(%esp)
call ___udivdi3 // A call for div
cmpl %edi, %edx
ja L6
jae L21
L6:
movl %esi, 8(%esp)
movl %edi, 12(%esp)
movl %ebx, (%esp)
movl %ebp, 4(%esp)
call ___umoddi3 // A call for modulo.
orl %eax, %edx
jne L5
What you really get in case 2
:
L26:
addl $2, %esi
adcl $0, %edi
movl %esi, %eax
movl %edi, %ecx
imull %esi, %ecx
mull %esi
addl %ecx, %ecx
addl %ecx, %edx
cmpl %edx, %ebx
ja L27
jae L41
L27:
movl %esi, 8(%esp)
movl %edi, 12(%esp)
movl %ebp, (%esp)
movl %ebx, 4(%esp)
call ___umoddi3 // Just one call for modulo
orl %eax, %edx
jne L26
MSVC failed to reuse the result of div
. The optimization is broken by return
.
Try these code:
__declspec(noinline) int is_prime_A(unsigned int n)
{
unsigned int p;
int ret = -1;
if (!(n & 1) || n < 2) return n == 2;
/* comparing p*p <= n can overflow */
p = 1;
do {
p += 2;
if (p >= n / p) ret = 1; /* Let's return latter outside the loop. */
if (!(n % p)) ret = 0;
} while (ret < 0);
return ret;
}
__declspec(noinline) int is_prime_B(unsigned int n)
{
unsigned int p;
if (!(n & 1) || n < 2) return n == 2;
/* comparing p*p <= n can overflow */
p = 1;
do {
p += 2;
if (p > n / p) return 1; /* The common routine. */
if (!(n % p)) return 0;
} while (1);
}
The is_prime_B
will be twice slower than is_prime_A
on MSVC / ICC for windows.