I haven't done this for double precision FP, but the same principles apply as for single precision, for which I have implemented division (as multiply by reciprocal).
What these FPGAs do have, instead of FPUs, is hardwired DSP/multiplier blocks, capable of implementing a 18*18 or (Virtex-5) 18*25 multiplication in a single cycle. And the larger devices have around a thousand of these, or even 126 or 180 at the top end of the Spartan-3 or Spartan-6 families.
So you can decompose a large multiplication into smaller operations using several of these (2 for the Virtex-5 doing single precision) using the DSP's adders or FPGA fabric to sum the partial products.
You will get an answer in a few cycles - 3 or 4 for SP, maybe 5 for DP - depending on how you compose the adder tree (and sometimes, where the synth tools insist on adding pipeline registers!).
However that is the latency - as it is pipelined, throughput will be 1 result per clock cycle.
For division, I approximated a reciprocal operator using a lookup table followed by quadratic interpolation. This was accurate to better than single-precision and would extend (with more hardware) to DP if I wanted. In Spartan-6 it takes 2 BlockRams and 4 DSP/multipliers, and a couple of hundred LUT/FF pairs.
Its latency is 8 cycles, but again the throughput is single-cycle, so by combining it with the above multiplier, you get one division per clock cycle.
It should exceed 100MHz in Spartan-3. In Spartan-6 the synthesis estimate is 185MHz but that's with 1.6ns on a single routing path, so 200MHz is within reason.
In Virtex-5 it reached 200MHz without effort, as did its square root twin. I had a couple of summer students attempt to re-pipeline it - with less than 12 cycles latency they got close to 400MHz - 2.5 ns for a square root.
But remember you have maybe a hundred to a thousand DSP units? That gives you one or two orders of magnitude more processing power than a single FP unit.
Are you running code from RAM or from flash? ARM processors that run code from flash often require wait states in at least some circumstances; such processors often include hardware which can eliminate most of the wait states in common code, but such hardware may be as simple as a single-line buffer which allows an access to the same line of flash as the previous access to avoid the wait state. If the branch target is the last word of a flash line, then the flash would require two or three cycles to fetch that word, and two or three cycles to fetch the following word. If one of the cycles is performed concurrently with some other CPU operation, that would leave a three-cycle penalty.
Best Answer
1 and 2: there's no hardware floating point unit on the M0, so it depends on your compiler alone. Expect on the order of tens to possibly low hundreds of cycles for single precision, with full IEEE compatibility. As for double precision, you're probably looking at high hundreds, maybe even breaking the thousand-cycle barrier, again assuming full IEEE compatibility.
3: single cycle.