How slow (how many cycles) is calculating a square root?

Based on Agner Fog’s instruction tables for Core 2 65nm when comparing the SSE performance with FSQRT, FDIV, FMUL and FADD, it is about equal, but looks faster because it can’t do 80bit math. SSE has a super fast approximate reciprocal and approximate reciprocal sqrt though.

On Core2 45nm, FSQRT and FDIV root got faster, while FADD and FMUL haven’t changed. Once again SSE performance is about the same.

Intel Core 2 (Merom, 65nm)

Instruction Operands Latency Reciprocal
throughput
FSQRT 6 – 69
FADD(P) r 3 1
FMUL(P) r 5 2
FDIV(R)(P) r 6 – 38 d 5 – 37 d
ADDSS/D xmm, xmm 3 1
ADDPS/D xmm, xmm 3 1
MULSS xmm, xmm 4 1
MULSD xmm, xmm 5 1
MULPS xmm, xmm 4 1
MULPD xmm, xmm 5 1
DIVSS xmm, xmm 6 – 18 d 5 – 17 d
DIVSD xmm, xmm 6 – 32 d 5 – 31 d
DIVPS xmm, xmm 6 – 18 d 5 – 17 d
DIVPD xmm, xmm 6 – 32 d 5 – 31 d
SQRTSS/PS xmm, xmm 6 – 29 6 – 29
SQRTSD/PD xmm, xmm 6 – 58 6 – 58
RSQRTSS/PS xmm, xmm 3 2

Intel Core 2 (Wolfdale, 45nm)

Instruction Operands Latency Reciprocal
throughput
FSQRT 6 – 20
FADD(P) r 3 1
FMUL(P) r 5 2
FDIV(R)(P) r 6 – 21 d 5 – 20 d
ADDSS/D xmm, xmm 3 1
ADDPS/D xmm, xmm 3 1
MULSS xmm, xmm 4 1
MULSD xmm, xmm 5 1
MULPS xmm, xmm 4 1
MULPD xmm, xmm 5 1
DIVSS xmm, xmm 6 – 13 d 5 – 12 d
DIVSD xmm, xmm 6 – 21 d 5 – 20 d
DIVPS xmm, xmm 6 – 13 d 5 – 12 d
DIVPD xmm, xmm 6 – 21 d 5 – 20 d
SQRTSS/PS xmm, xmm 6 – 13 5 – 12
SQRTSD/PD xmm, xmm 6 – 20 5 – 19
RSQRTSS/PS xmm, xmm 3 2

The figures in the instruction tables represent the results of my measurements rather than the official values published by microprocessor vendors. Some values in my tables are higher or lower
than the values published elsewhere.

Latency: This is the delay that the instruction generates in a dependency chain. The numbers are minimum values. Cache misses, misalignment, and exceptions may increase the clock counts considerably. Floating point operands are presumed to be normal numbers. Denormal numbers, NAN’s and infinity increase the delays very much, except in XMM move, shuffle and Boolean instructions. Floating point overflow, underflow, denormal or NAN results give a similar delay. The time unit used is core clock cycles, not the reference clock cycles given by the time stamp counter.

Reciprocal throughput: The average number of core clock cycles per instruction for a series of independent instructions of the same kind in the same thread.

d Round divisors or low precision give low values.

Leave a Comment