By the "integer square root"
function, we mean the function To extend its range of application and to avoid deciding what to do
with a negative argument, we assume x is
unsigned. Thus, 0
x
232 - 1.
For floating-point numbers, the square root is almost
universally computed by Newton's method. This method begins by somehow
obtaining a starting estimate g0 of Then, a series of more accurate estimates is obtained from
The iteration converges quadratically梩hat is, if at some point is accurate to n bits, then gn + 1 is accurate to 2n bits. The program must have some means of knowing when it has iterated enough, so it can terminate.
It is a pleasant surprise that Newton's method works fine in the domain of integers. To see this, we need the following theorem:
That is, if we have an integral guess gn to that is too
high, then the next guess gn +
1 will be strictly less than the preceding one, but not less than
Therefore, if we start with a guess that's too high, the sequence
converges monotonically. If the guess
then the next
guess is either equal to gn or is 1
larger. This provides an easy way to determine when the sequence has converged:
If we start with
convergence has occurred when gn + 1
gn,
and then the result is precisely gn.
The case a = 0, however, must be treated specially, because this procedure would lead to dividing 0 by 0.
Proof. (a) Because gn is an integer,
Because and gn
is an integer,
Define e
by
Then e
> 0 and
(b) Because so that
Hence we have
The difficult part of using Newton's method to calculate is getting the first guess. The procedure of Figure 11-1 sets the first guess g0 equal to the least power of 2 that is
greater than or equal to
For example, for x =4, g0 =
2, and for x = 5, g0
= 4.
int isqrt(unsigned x) {
牋 unsigned x1;
牋爄nt s, g0, g1;
牋 if (x <= 1) return x;
牋爏 = 1;
牋爔1 = x - 1;
牋爄f (x1 > 65535) {s = s + 8; x1 = x1 >> 16;}
牋爄f (x1 > 255)牋 {s = s + 4; x1 = x1 >> 8;}
牋爄f (x1 > 15)牋?{s = s + 2; x1 = x1 >> 4;}
牋爄f (x1 > 3)牋牋 {s = s + 1;}
牋 g0 = 1 << s;牋牋牋牋牋牋牋?// g0 = 2**s.
牋爂1 = (g0 + (x >> s)) >> 1;?// g1 = (g0 + x/g0)/2.
牋 while (g1 < g0) {牋牋牋牋牋 // Do while approximations
牋牋牋g0 = g1;牋牋牋牋牋牋牋牋 // strictly decrease.
牋牋牋g1 = (g0 + (x/g0)) >> 1;
牋爙
牋爎eturn g0;
}
Because the first guess g0 is a power of 2, it is not necessary to do a real division to get g1; instead, a shift right suffices.
Because the first guess is accurate to about 1 bit, and Newton's method converges quadratically (the number of bits of accuracy doubles with each iteration), one would expect the procedure to converge within about five iterations (on a 32-bit machine), which requires four divisions (because the first iteration substitutes a shift right). An exhaustive experiment reveals that the maximum number of divisions is five, or four for arguments up to 16,785,407.
If number of leading zeros is available, then getting the first guess is very simple: Replace the first seven executable lines in the procedure above with
if (x <= 1) return x;
s = 16 - nlz(x - 1)/2;
Another alternative, if number of
leading zeros is not available, is to compute s
by means of a binary search tree. This method permits getting a slightly better
value of g0: the least power of 2
that is greater than or equal to For some values of x, this gives a smaller value of g0, but a value large enough so that the
convergence criterion of the theorem still holds. The difference in these
schemes is illustrated below.
Range of x for Figure 11-1 |
Range of x for Figure 11-2 |
First Guess g0 |
0 |
0 |
0 |
1 |
1 to 3 |
1 |
2 to 4 |
4 to 8 |
2 |
5 to 16 |
9 to 24 |
4 |
17 to 64 |
25 to 80 |
8 |
65 to 256 |
81 to 288 |
16 |
?/p> |
?/p> |
?/p> |
228 + 1 to 230 |
(214 + 1)2 to (215 + 1)2 - 1 |
215 |
230 + 1 to 232 - 1 |
(215 + 1)2 to 232 - 1 |
216 |
This procedure is shown in Figure 11-2. It is convenient there to treat
small values of x (0 x
24) specially, so that
no divisions are done for them.
int isqrt(unsigned x) {
牋 int s, g0, g1;
牋 if (x <= 4224)
牋牋牋if (x <= 24)
牋牋牋牋爄f (x <= 3) return (x + 3) >> 2;
牋牋牋牋爀lse if (x <= 8) return 2;
牋牋牋牋爀lse return (x >> 4) + 3;
牋牋牋else if (x <= 288)
牋牋牋牋爄f (x <= 80) s = 3; else s = 4;
牋牋牋else if (x <= 1088) x = 5; else s = 6;
牋爀lse if (x <= 1025*1025 - 1)
牋牋牋if (x <= 257*257 - 1)
牋牋牋牋爄f (x <= 129*129 - 1) s = 7; else s = 8;
牋牋牋else if (x <= 513*513 - 1) s = 9; else s = 10;
牋爀lse if (x <= 4097*4097 - 1)
牋牋牋if (x <= 2049*2049 - 1) s = 11; else s = 12;
牋爀lse if (x <= 16385*16385 - 1)
牋牋牋if (x <= 8193*8193 - 1) s = 13; else s = 14;
牋爀lse if (x <= 32769*32769 - 1) s = 15; else s = 16;
牋爂0 = 1 << s;牋牋牋牋牋牋牋?// g0 = 2**s.
牋 // Continue as in Figure 11-1.
The worst-case execution time of the algorithm of Figure 11-1, on the basic RISC, is about 26 + (D + 6)n cycles, where D is the divide time in cycles and n is the number of times the while-loop is executed. The worst-case execution time of Figure 11-2 is about 27 + (D + 6)n cycles, assuming (in both cases) that the branch instructions take one cycle. The table below gives the average number of times the loop is executed by the two algorithms, for x uniformly distributed in the indicated range.
x |
||
0 to 9 |
0.80 |
0 |
0 to 99 |
1.46 |
0.83 |
0 to 999 |
1.58 |
1.44 |
0 to 9999 |
2.13 |
2.06 |
0 to 232 - 1 |
2.97 |
2.97 |
If we assume a divide time of 20 cycles and x ranging uniformly from 0 to 9999, then both algorithms execute in about 81 cycles.
Because the algorithms based on Newton's method start out with a sort of binary search to obtain the first guess, why not do the whole computation with a binary search? This method would start out with two bounds, perhaps initialized to 0 and 216. It would make a guess at the midpoint of the bounds. If the square of the midpoint is greater than the argument x, then the upper bound is changed to be equal to the midpoint. If the square of the midpoint is less than the argument x, then the lower bound is changed to be equal to the midpoint. The process ends when the upper and lower bounds differ by 1, and the result is the lower bound.
This avoids division, but requires quite a few
multiplications?6 if 0 and 216 are used as the initial bounds. (The
method gets one more bit of precision with each iteration.) Figure 11-3 illustrates a variation of this
procedure, which uses initial values for the bounds that are slight
improvements over 0 and 216. The procedure shown in Figure 11-3 also saves a cycle in the loop,
for most RISC machines, by altering a and b in such a way that the comparison is b a rather than b - a
1.
int isqrt(unsigned x) {
牋 unsigned a, b, m;牋牋牋牋牋?// Limits and midpoint.
牋 a = 1;
牋燽 = (x >> 5) + 8;牋牋牋牋牋?// See text.
牋爄f (b > 65535) b = 65535;
牋燿o {
牋牋?m = (a + b) >> 1;
牋牋牋if (m*m > x) b = m - 1;
牋牋牋else牋牋牋牋 a = m + 1;
牋爙 while (b >= a);
牋爎eturn a - 1;
}
The predicates that must be maintained at the beginning of
each iteration are and
The initial value of b should be something that's easy to compute and
close to
Reasonable initial values are x, x ?4 + 1, x ?8 + 2, x ?16 +
4, x ?32 + 8, x
?64 + 16, and so on. Expressions near the beginning of this list are better
initial bounds for small x, and those near the
end are better for larger x. (The value x ?2 + 1 is acceptable, but probably not useful,
because x ?4 + 1 is everywhere a better or
equal bound.)
Seven variations on the procedure shown in Figure 11-3 can be more or less mechanically generated by substituting a + 1 for a, or b - 1 for b, or by changing m = (a + b) ?2 to m = (a + b + 1) ?2, or some combination of these substitutions.
The execution time of the procedure shown in Figure 11-3 is about 6 + (M + 7.5)n, where M is the multiplication time in cycles and n is the number of times the loop is executed. The table below gives the average number of times the loop is executed, for x uniformly distributed in the indicated range.
x |
Average Number of Loop Iterations |
0 to 9 |
3.00 |
0 to 99 |
3.15 |
0 to 999 |
4.68 |
0 to 9999 |
7.04 |
0 to 232 - 1 |
16.00 |
If we assume a multiplication time of 5 cycles and x ranging uniformly from 0 to 9999, the algorithm runs in about 94 cycles. The maximum execution time (n = 16) is about 206 cycles.
If number of leading zeros is available, the initial bounds can be set from
b = (1 << (33 - nlz(x))/2) - 1;
a = (b + 3)/2;
That is, These are very good
bounds for small values of x (one loop
iteration for 0
x
15), but only a moderate improvement, for large x, over the bounds calculated in Figure 11-3. For x
in the range 0 to 9999, the average number of iterations is about 5.45, which
gives an execution time of about 74 cycles, using the same assumptions as
above.
There is a shift-and-subtract algorithm for computing the square root that is quite similar to the hardware division algorithm described in Figure 9-2 on page 149. Embodied in hardware on a 32-bit machine, this algorithm employs a 64-bit register that is initialized to 32 0-bits followed by the argument x. On each iteration, the 64-bit register is shifted left two positions, and the current result y (initially 0) is shifted left one position. Then 2y + 1 is subtracted from the left half of the 64-bit register. If the result of the subtraction is nonnegative, it replaces the left half of the 64-bit register, and 1 is added to y (this does not require an adder, because y ends in 0 at this point). If the result of the subtraction is negative, then the 64-bit register and y are left unaltered. The iteration is done 16 times.
This algorithm was described in 1945 [JVN].
Perhaps surprisingly, this process runs in about half the time
of that of the 64 ?32 32 hardware division algorithm cited, because it
does half as many iterations and each iteration is about equally complex in the
two algorithms.
To code this algorithm in software, it is probably best to avoid the use of a doubleword shift register, which requires about four instructions to shift. The algorithm in Figure 11-4 [GLS1] accomplishes this by shifting y and a mask bit m to the right. It executes in about 149 basic RISC instructions (average). The two expressions y | m could also be y + m.
int isqrt(unsigned x) {
牋 unsigned m, y, b;
牋 m = 0x40000000;
牋爕 = 0;
牋爓hile(m != 0) {牋牋牋牋牋牋?// Do 16 times.
牋牋牋b = y | m;
牋牋牋y = y >> 1;
牋牋牋if (x >= b) {
牋牋牋牋 x = x - b;
牋牋牋牋爕 = y | m;
牋牋牋}
牋牋牋m = m >> 2;
牋爙
牋爎eturn y;
}
The operation of this algorithm is similar to the grade-school
method. It is illustrated below, for finding on an 8-bit machine.
?011 0011?x0牋 Initially, x = 179 (0xB3).
- 1牋牋牋牋 b1
牀棗棗棗棗
?111 0011?x1牋 0100 0000?y1
- 101牋牋牋 b2牋 0010 0000?y2
牀棗棗棗棗
?010 0011?x2牋 0011 0000?y2
-?11 01牋?b3牋 0001 1000?y3
牀棗棗棗棗
?010 0011?x3牋 0001 1000?y3牋 (Can't subtract).
-牋 1 1001?b4牋 0000 1100?y4
牀棗棗棗棗
?000 1010?x4牋 0000 1101?y4
The result is 13 with a remainder of 10 left in register x.
It is possible to eliminate the if x >= b
test by the usual trickery involving shift right
signed 31. It can be proved that the high-order bit of b is always zero (in fact, b 5 ?228), which simplifies the x >=
b predicate (see page 22). The
result is that the if statement group can be
replaced with
t = (int)(x | ~(x - b)) >> 31;牋 // -1 if x >= b, else 0.
x = x - (b & t);
y = y | (m & t);
This replaces an average of three cycles with seven, assuming the machine has or not, but it might be worthwhile if a conditional branch in this context takes more than five cycles.
Somehow it seems that it should be easier than some hundred cycles to compute an integer square root in software. Toward this end, we offer the expressions below to compute it for very small values of the argument. These can be useful to speed up some of the algorithms given above, if the argument is expected to be small.
The expression |
is correct in the range |
and uses this many instructions (full RISC). |
x |
0 to 1 |
0 |
x > 0 |
0 to 3 |
1 |
|
0 to 3 |
2 |
|
0 to 3 |
2 |
|
0 to 5 |
2 |
|
1 to 8 |
2 |
|
4 to 15 |
2 |
(x > 0) + (x > 3) |
0 to 8 |
3 |
(x > 0) + (x > 3) + (x > 8) |
0 to 15 |
5 |
Ah, the elusive square root,
It should be a cinch to compute.
But the best we can do
Is use powers of two
And iterate the method of Newt!