Fast Reciprocal Square Root
In some areas of computation, e.g. computer graphics or numerics, one has to evaluate the reciprocal value of a square root of a number:
Newton iteration
One possible alternative for computing the above is by Newtons method applied to
.
Note, that the last formula only involves addition and multiplication. All divisions have been eliminated. Hence this can be computed very fast. All one needs now is a good start value x0 for the iteration.
On some processors, e.g. with SSE, 3DNow or Altivec instructions, this can be achieved by using the rsqrt command, which computes an approximation of the above up to some CPU dependend precision, e.g. 2-12 with SSE. If this is all you need, you can even stop after that, or you can iterate according to the above rule. For single precision one Newton step and for double precision two steps are enough.
An alternative to the rsqrt instruction is an algorithm based on a lookup table presented in Graphic Gems V by Ken Turkowski. It uses the higher order bits of the mantissa to determine an intial value for the Newton iteration.
Benchmarks
In the following table the results from some benchmarks are shown. The benchmark itself just consists of computing several millions of square roots.
| Processor | MHz | libm single | frsqrt single | libm double | frsqrt double |
|---|---|---|---|---|---|
| UltraSparcIII | 900 | 31.46 s | 7.96 s | 14.58 s | 7.83 s |
| Athlon MP | 1666 | 6.84 s | 3.90 s | 9.57 s | 4.41 s |
| Opteron 248 | 2200 | 5.18 s | 2.94 s | 7.28 s | 3.32 s |
| Pentium2 | 400 | 79.85 s | 32.25 s | 78.65 s | 30.38 s |
| Pentium3 | 1266 | 15.63 s | 5.59 s | 25.47 s | 5.53 s |
| PentiumM | 1700 | 18.85 s | 3.77 s | ||
| Pentium4 | 3066 | 6.49 s | 2.75 s | 7.45 s | 6.35 s |
| POWER4 | 1300 | ||||
| PowerPC 7450 (G4) | 1000 | 70.90 s | 9.40 s | 68.73 s | 6.99 s |
| PowerPC 970 (G5) | 1800 | 3.94 (*) | 4.91 | 5.72 | 5.02 (*) |
| PA-RISC 8700 | 875 | 11.35 s | 7.65 s | 12.41 s | 8.06 s |
| Via C3 | 1000 | 29.69 s | 27.55 s | 52.18 s | 39.67 s |
In the following table the above results are normalised to a processor speed of 1 GHz. This allows some insights into the instructions-per-cycle characteristics of the CPUs.
| Processor | MHz | libm single | frsqrt single | libm double | frsqrt double |
|---|---|---|---|---|---|
| UltraSparcIII | 900 | 28.31 s | 7.16 s | 13.12 s | 7.05 s |
| Athlon MP | 1666 | 11.68 s | 6.50 s | 15.95 s | 7.35 s |
| Opteron 248 | 2200 | 11.40 s | 6.47 s | 16.02 s | 7.30 s |
| Pentium2 | 400 | 31.94 s | 12.90 s | 31.46 s | 12.15 s |
| Pentium3 | 1266 | 19.80 s | 7.08 s | 32.22 s | 7.01 s |
| PentiumM | 1700 | 32.05 s | 6.41 s | ||
| Pentium4 | 3066 | 19.89 s | 8.42 s | 22.86 s | 19.46 s |
| POWER4 | 1300 | ||||
| PowerPC 7450 (G4) | 1000 | 70.90 s | 9.40 s | 68.73 s | 6.99 s |
| PowerPC 970 (G5) | 1800 | 7.09 s | 8.84 s | 10.30 s | 9.04 s |
| PA-RISC 8700 | 875 | 9.93 s | 6.69 s | 10.86 s | 7.05 s |
| Via C3 | 1000 | 29.69 s | 27.55 s | 52.18 s | 39.67 s |
(*) IBM VisualAge compiler without handcoded AltiVec.
Code
The above description also found their way into some code:
Have fun with it.