Electronic – Why does hardware division take much longer than multiplication

algorithmarithmetic-divisionhardwaremathmicrocontroller

Why does hardware division take so much longer than multiplication on a microcontroller? E.g., on a dsPIC, a division takes 19 cycles, while multiplication takes only one clock cycle.

I went through some tutorials, including Division algorithm and Multiplication algorithm on Wikipedia. Here is my reasoning.

A division algorithm, like a slow division method with restoring on Wikipedia, is a recursive algorithm. This means that (intermediate) results from step k are used as inputs to step k+1, which means that these algorithms cannot be parallelized. Therefore, it takes at least n cycles to complete the division, whereas n is a number of bits in a dividend. For 16-bit dividends, this is equal to at least 16 cycles.

A multiplication algorithm doesn't need to be recursive, which means that it is possible to parallelize it. However, there are many different multiplication algorithms, and I don't have a clue which one may be used by microcontrollers. How does multiplication work on a hardware/microcontroller?

I've found a Dadda multiplier algorithm, which is supposed to take only one clock cycle to finish. However, what I don't get here is that Dadda's algorithm proceeds in three steps, whereas results from step 1 are used in step 2, etc. According to this, this would take at least three clock cycles to finish.

Best Answer

A divider maps much less elegantly to typical hardware. Take Lattice ICE40 FPGAs as examples.

Let us compare two cases: this 8x8 bit to 16 bit multiplier:

module multiply (clk, a, b, result);
   input clk;
   input [7:0]a;
   input [7:0]b;
   output [15:0]result;
   always @(posedge clk)
     result = a * b;
endmodule // multiply

and this divider that reduces 8 and 8 bit operands to 8 bit result:

module divide(clk, a, b, result);
   input clk;
   input [7:0] a;
   input [7:0] b;
   output [7:0] result;
   always @(posedge clk)
     result = a / b;
endmodule // divide

(Yes, I know, the clock doesn't do anything)

An overview of the generated schematic when mapping the multiplier to an ICE40 FPGA can be found here and the divider here.

The synthesis statistics from Yosys are:

multiply

Number of wires: 155
Number of wire bits: 214
Number of public wires: 4
Number of public wire bits: 33
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 191
- SB_CARRY 10
- SB_DFF 16
- SB_LUT4 165

divide

Number of wires: 145
Number of wire bits: 320
Number of public wires: 4
Number of public wire bits: 25
Number of memories: 0
Number of memory bits: 0
Number of processes: 0
Number of cells: 219
- SB_CARRY 85
- SB_DFF 8
- SB_LUT4 126

It's worth noting that the size of the generated verilog for a full-width multiplier and a maximally-dividing divider aren't that extreme. However, if you'll look at the pictures below, you'll notice the multiplier has maybe a depth of 15, whereas the divider looks more like 50 or so; the critical path (i.e. the longest path that can occur during operation) is what defines the speed!

You won't be able to read this, anyway, to get a visual impression. I think the differences in complexity are possible to spot. These are single cycle multiplier/dividers!

Multiply

Multiply on an ICE40 (warning: ~100 Mpixel image)

Divide

(Divide on an ICE40) (warning: ~100 Mpixel image)

Related Solutions

Electronic – hardware implementation of division algorithm

Flowcharts are often not a very precise way of indicating what hardware is doing, since flowcharts often imply the existence of a single execution process, whereas hardware often does many overlapping and simultaneous operations.

The portions of the diagram circled in red seem a bit odd. It seems odd to latch A with the value after subtracting B, and then re-add B. More natural would simply be to not bother latching the lower part of the subtraction result. I think the flowchart might be clearer if "named values" were separated into "registers" and "values", and each step either computed values or registers. Thus, for example, one could have something like (assuming 16-bit registers)

C:T[15..0] = (A[14..0]:Q[15]) + ~B-1
if (C or A[15])
  A[15..0] = (A[14..0]:Q[15])
  Q[15..1] = Q[14..0]
  Q[0] = 1
Else
  A[15..0] = T[15..0]
  Q[15..1] = Q[14..0]
  Q[0] = 0
Endif

Every step that updates registers would represent a system clock. Events that merely compute values would not require a clock edge, but would be processed asynchronously.

Electronic – FPU vs. Fixed point multiplication/division speed

Floating point is more complicated to implement in silicon than fixed point, so usually has more delay. However, every implementation is different, and there are various ways to trade off speed with the number of transistors used.

Your question can therefore only be answered for specific units by looking at their datasheets or running benchmarks. In general, fixed point multiplication and addition will probably be faster than floating point given a fixed point. FP division could possibly be faster because there might be fewer bits to divide. Divide algorithms are often per-bit loops.

Again, the only way to know for sure is to compare the specs or run benchmarks.