The Double-dabble technique converts binary to BCD by repeated shifting. Each repetition halves the remaining binary number and doubles the BCD number, after the complete binary value is shifted the result is obtained. After each shift a correction is applied to each 4-bit BCD column (or those having more than 3 bits shifted in by that point). This correction looks for digits that will 'BCD overflow' decimal 9 -> 10 on the next shift and patches the result by adding three.
Why three? BCD digits in the range zero to four (0,1,2,4) will double naturally to 0,2,4,8 after the shift. Examining 5 b 0101
, that will shift to b 1010
(0xA), which is not a BCD digit. 5 is therefore corrected to (3+5) i.e. b 1000
(0x8) which during the shift doubles to 16 decimal (0x10), representing a carry out of 1 to the next digit and the expected zero.
Implementations repeat this process, either synchronously in time using a shift register and 'n' cycles for an n-bit input, or in space by placing the logic circuits for the correction feeding each other and doing the shift with wiring. There is a carry path right through every digit, and the carry logic is not suited to FPGA (binary) carry chain logic, so the space implementation generally gives unacceptable timing results for large inputs. A typical engineering trade-off.
For a parallel (asynchronous) conversion
For narrow values like yours Dr. John Loomis's site has a guide to the logic structure required to implement in hardware. Modern reprogrammable logic can do 8 bits wide to maybe 100mhz after aggressive synthesis. The module add3
takes a 4-bit input and outputs it verbatim, or if more than four, adds three:
module add3(in,out);
input [3:0] in;
output [3:0] out;
reg [3:0] out;
always @ (in)
case (in)
4'b0000: out <= 4'b0000; // 0 -> 0
4'b0001: out <= 4'b0001;
4'b0010: out <= 4'b0010;
4'b0011: out <= 4'b0011;
4'b0100: out <= 4'b0100; // 4 -> 4
4'b0101: out <= 4'b1000; // 5 -> 8
4'b0110: out <= 4'b1001;
4'b0111: out <= 4'b1010;
4'b1000: out <= 4'b1011;
4'b1001: out <= 4'b1100; // 9 -> 12
default: out <= 4'b0000;
endcase
endmodule
Combining these modules together gives the output.
For a sequential (multi-cycle, pipelined) variant
For wide signals a serial technique described in Xlinx App Note "XAPP 029" runs 1-bit per cycle, probably at 300mMhz+.
If anyone knows a good hybrid technique I'd be interested to know it. I modelled both in Verilog with test benches in my verilog-utils collection.
You can manipulate arbitrarily wide numbers using a finite width ALU. However, multiple operations are required when the number is wider than the ALU.
To illustrate this, I'll use a PIC 18 as example. This architecture can manipulate only 8 bit numbers directly. To add the 8 bit quantities VAR1 and VAR2 and put the result into SUM would take code like this (assuming banking isn't needed or the bank is already properly set up):
movf var1, w ;get VAR1 into accumulator
addwf var2, w ;add VAR2 to it
movwf sum ;save the result
If instead VAR1, VAR2, and SUM were 32 bit variables, the code would be:
movf var1+0, w ;make byte 0 (low byte)
addwf var2+0, w
movwf sum+0
movf var1+1, w ;make byte 1
addwfc var2+1, w
movwf sum+1
movf var1+2, w ;make byte 2
addwfc var2+2, w
movwf sum+2
movf var1+3, w ;make byte 3 (high byte)
addwfc var2+3, w
movwf sum+3
Each byte is handled separately. Note the use of ADDWFC instead of ADDWF for all but the first byte. This instruction adds the two bytes but also adds the carry from the previous ALU operation.
Hopefully you can see that this approach can be used to add two arbitrarily large numbers. Wide arithmetic is possible, just that it takes longer and requires multiple trips thru the ALU to perform the whole operation.
Operations like AND, OR, XOR, etc, are bit-wise independent. Each output bit can be computed given only the corresponding input bits. Such operations scale linearly with the width of the values being processed. Addition requires the two input bits but also the result of the carry from the next lower bit. That requires one bit of state to be kept between successive operations, but still scales linearly with the width of the number.
Multiply scales with the square of the width of the number since each chunk of one number must be multiplied by all chunks of the other number. Multiplying two 32 bit numbers, which can produce up to a 64 bit result, takes a lot longer than 8 x 8 on a 8 bit ALU, but it's all possible.
Best Answer
Use 2 74LS148 or 2 74HC148 priority encoders, which drive a quad nand gate (74HC00) to get your 4 bit output. The inputs are active low, and so are the outputs, hence the need for the nand gates.
Wire the 'EI' input of the msb (the IC with inputs 8-15) to ground, and the 'EO' out to the 'EI' input of the next stage, which is the 0-7 inputs. The nand gates sum the 3 bit outputs and use the 'EO' output of the msb to drive the 4th nand gate, the msb of 4. The unused pin of gate 4 should be tied to logic 'High'. Now you have 3 IC's to get the output you want.
EDIT: To convert this binary output (which is not latched) to decimal requires a simple PIC MPU and the LED type number displays. If you can find an old TIL311 hexadecimal display on the market it will give you the hex value at the nand gate outputs.
The decimal display requires software to convert to decimal format, first latching in the newest hex value entered, then convert to a format compatible with the LED's, or you can output 2 BCD values and have them drive 2 7447 or 74LS47 BCD to 7-segment LED number displays.
You would need 10 to 15 IC's to latch, compare, convert to decimal then 7-segment LED drive output. With software you only need 1 to 3 IC's to do the same.