A routine which I have used before (I don't know if it's a "proper" one or not) is a divide-and-conquer approach.
You start off with an arbitrary upper and lower value (say 5 and 0 respectively - the highest and lowest square roots you want to find) and find the mid-point between them. Square that value.
If the squared value is greater than your target, set the upper value to be your squared value. If it's lower, set the lower value.
Repeat until either the square value matches your lookup value, or you have executed enough iterations to be as accurate as you like.
Here's a little version I have knocked together in perl:
#!/usr/bin/perl
my $val = shift;
my $max = 5;
my $min = 0;
my $iterations = 0;
my $maxiter = 40;
while(($max > $min) and ($iterations<$maxiter))
{
$iterations++;
my $diff = $min + ($max - $min) / 2;
my $square = $diff * $diff;
if($square == $val)
{
print "Square root found at $diff\n";
print "$iterations iterations\n";
exit(0);
} else {
if($square > $val)
{
$max = $diff;
} else {
$min = $diff;
}
}
}
my $diff = $min + ($max - $min) / 2;
print "Approximate square root after $iterations iterations: $diff\n";
This of course is using floating point, but could easilly be addapted to fixed point. You can vary the accuracy by changing the iteration limit. Each iteration gets slightly more accurate than the one before.
eg: - find the square root of 9:
Approximate square root after 40 iterations: 2.99999999999955
- or -
Approximate square root after 10 iterations: 3.00048828125
- or -
Approximate square root after 5 iterations: 3.046875
If it had found the value 3 it would have stopped early of course.
Give it enough iterations and it should get it very accurate:
./sqrt.pl 0.00284
Square root found at 0.0532916503778969
59 iterations
Best Answer
A colleague of mine benchmarked this and came to the conclusion that FPGAs would outperform a PC once you had more than about 100 independent, integer tasks that would fit in the FPGA. For floating point tasks GPGPU beat FPGA throughout. For narrow multithreading or SIMD operation then CPUs are extremely optimised and run at a higher clock speed than FPGAs typically achieve.
The other caveats: tasks must be independent. If there are data dependencies between tasks then that limits the critical path of computation. FPGAs are good for boolean evaluation and integer maths, as well as hardware low-latency interfaces, but not for memory-dependent workloads or floating point.
If you have to keep the workload in DRAM then that will be the bottleneck rather than the processor.