Fast Reciprocal Square Root

In some areas of computation, e.g. computer graphics or numerics, one has to evaluate the reciprocal value of a square root of a number:

x = 1 / sqrt(y)
This involves computing the square root and a division. Considering the number of cycles a modern processor needs for these simple tasks, e.g. 38 for each operation on a Pentium4, the wish might arise to speed things up. This even more applies to processors which do not compute the square root in hardware but by microcode, e.g. PowerPC CPUs.

Newton iteration

One possible alternative for computing the above is by Newtons method applied to

f(x) = y - 1/x^2
The first derivative is then
f'(x) = 2/x^3
which, together with the Newton step
x_i+1 = x_i - f(x_i) / f'(x_i)
yields the iteration
newton iteration.

Note, that the last formula only involves addition and multiplication. All divisions have been eliminated. Hence this can be computed very fast. All one needs now is a good start value x0 for the iteration.

On some processors, e.g. with SSE, 3DNow or Altivec instructions, this can be achieved by using the rsqrt command, which computes an approximation of the above up to some CPU dependend precision, e.g. 2-12 with SSE. If this is all you need, you can even stop after that, or you can iterate according to the above rule. For single precision one Newton step and for double precision two steps are enough.

An alternative to the rsqrt instruction is an algorithm based on a lookup table presented in Graphic Gems V by Ken Turkowski. It uses the higher order bits of the mantissa to determine an intial value for the Newton iteration.

Benchmarks

In the following table the results from some benchmarks are shown. The benchmark itself just consists of computing several millions of square roots.

Processor   MHz   libm single frsqrt single libm double frsqrt double
UltraSparcIII 900 31.46 s 7.96 s 14.58 s 7.83 s
Athlon MP 1666 6.84 s 3.90 s 9.57 s 4.41 s
Opteron 248 2200 5.18 s 2.94 s 7.28 s 3.32 s
Pentium2 400 79.85 s 32.25 s 78.65 s 30.38 s
Pentium3 1266 15.63 s 5.59 s 25.47 s 5.53 s
PentiumM 1700 18.85 s 3.77 s
Pentium4 3066 6.49 s 2.75 s 7.45 s 6.35 s
POWER4 1300
PowerPC 7450 (G4) 1000 70.90 s 9.40 s 68.73 s 6.99 s
PowerPC 970 (G5) 1800 3.94 (*) 4.91 5.72 5.02 (*)
PA-RISC 8700 875 11.35 s 7.65 s 12.41 s 8.06 s
Via C3 1000 29.69 s 27.55 s 52.18 s 39.67 s
Time to evaluate 3*108 square roots.

In the following table the above results are normalised to a processor speed of 1 GHz. This allows some insights into the instructions-per-cycle characteristics of the CPUs.

Processor   MHz   libm single frsqrt single libm double frsqrt double
UltraSparcIII 900 28.31 s 7.16 s 13.12 s 7.05 s
Athlon MP 1666 11.68 s 6.50 s 15.95 s 7.35 s
Opteron 248 2200 11.40 s 6.47 s 16.02 s 7.30 s
Pentium2 400 31.94 s 12.90 s 31.46 s 12.15 s
Pentium3 1266 19.80 s 7.08 s 32.22 s 7.01 s
PentiumM 1700 32.05 s 6.41 s
Pentium4 3066 19.89 s 8.42 s 22.86 s 19.46 s
POWER4 1300
PowerPC 7450 (G4) 1000 70.90 s 9.40 s 68.73 s 6.99 s
PowerPC 970 (G5) 1800 7.09 s 8.84 s 10.30 s 9.04 s
PA-RISC 8700 875 9.93 s 6.69 s 10.86 s 7.05 s
Via C3 1000 29.69 s 27.55 s 52.18 s 39.67 s
Time to evaluate 3*108 square roots (normalised).

(*) IBM VisualAge compiler without handcoded AltiVec.

Code

The above description also found their way into some code:

Have fun with it.