LISTING 2. PIC version of fast square root,
built on macros for 32-bit operations.
clrf32 root ; root = 0
clrf16 trial ; trial = 0x80000000
clrf trial+2 ; "
movlw 0x80 ; "
movwf trial+3 ; "
bcf STATUS,C; Clear C flag for following rotate
rrf32 trial ; trial >>= 1
ior32 root,trial ; root != trial
cmp32 root,arg; if (arg < root) - C flag set if arg < root
bnc RootFits; "
; Didn't fit, remove trial bit
xor32 root,trial ; root ^= trial
arg,root; arg -= root
root,trial ; root += trial
STATUS,C; Clear C flag for following rotate
root ; root >>= 1 - Always leaves C flag clear
trial ; trial >>= 1
RootBit ; if trial bit shifted out, we're done
; Compute rounding. Note that root always fits in 16 bits here,
; but it could round up to 17 bits.
movf arg+2,f ; see if bits 17-23 are non-zero
bnz IncResult ; if non-zero, arg is big, always > root
cmp16 arg,root; if (arg > root) - C flag set if root < arg
bnc Done ; "
inc24 root ; root++
The function starts by initializing
the variable trial which contains the
current trial bit. The function is
designed to be easily targeted for
16-bit or for 32-bit integers (or any
other size), which explains the somewhat complicated-looking expression
that initializes trial. I’ll assume we’re
using 32-bit integers, in which case trial
gets initialized to 0x80000000 — a 1 in
the MSB and the rest 0. Note that the
first thing that happens inside the loop
is that trial is shifted right one bit, so
the first trial bit value is 0x40000000.
In the main loop, the variable
named root actually keeps the value of
two times the current root. Combining
trial into root gives us the amount to
subtract from the current remainder
(called arg) if it fits. If it doesn’t fit, the
exclusive-OR operation is used to clear
the trial bit in root. If it does fit, we need
to double the trial bit so that, as part of
the new root, it will be root times two.
At the bottom of the loop, root
and trial are each shifted right one bit.
This moves them to the next bit
position for which we’ll be computing
the root. Looking back at Figure 4, you
can see how the subtraction for each
bit is shifted right compared to the
previous bit. Conveniently, when it’s
time to end the loop, this right shift of
I have never seen any description of
this algorithm published before. I don’t
mean that it’s original with me — in fact, I
believe this algorithm is widely used in
floating-point hardware. I became aware
of the algorithm many years ago when
reading about a particular computer
architecture that included hardware
I have an old Pentium manual that
lists the time to perform a floating-point
divide as 39 clock cycles, and the time to
perform square root as 70 clock cycles –
less than twice as long. The only way that
could be happening is if the floating-point unit implements this algorithm in
For a processor that has hardware
divide but no hardware square root, the
most common way to compute square
root is with Newton-Raphson iteration.
This algorithm is very simple:
1) Make a guess at the root.
40 SERVO 01.2008
Square Root Algorithms
2) Divide the guess into the argument.
3) Average the quotient with the guess to
make the next guess.
This method converges very fast;
every iteration doubles the number of
digits (or bits) of accuracy. The trick is
to come up with a good initial guess.
This is difficult for integer or fixed-point arithmetic, where it may take one
iteration to get one bit correct. Then the
subsequent iterations give you 2, 4, 8,
and 16 bits of accuracy — typically five
total iterations for a 16-bit root of a 32-bit
When using floating point arithmetic,
it’s easy to get an accurate initial guess
because of the “normalized” format of
the numbers. With an eight-bit multiply
and addition, you can come up with a
guess accurate to almost five bits. Then
only three iterations are needed to
achieve single-precision accuracy.
I used this technique when I wrote
the x87 emulator for Microsoft in 1991.
This program provided floating-point
arithmetic back in the days before it was
built into every processor; it hasn’t been
needed since the 486SX. It is still widely
distributed, included as WIN87EM.DLL
with every 32-bit operating system
Microsoft has shipped.
While researching for this article, I
found another square root algorithm
on the Microchip website posted in
Application Note TB040 for PICs that
have hardware multiply. The general
idea is to set a trial bit in the root,
square the root, and see if it was too
big. The algorithm calculates the result
bit by bit, with one square (multiply)
operation per bit. According to the
app note, with a 32-bit input this
method would take 200 microseconds
at 20 MHz, and the code size appeared
to be over 200 bytes. So it is both
bigger and slower than the algorithm I