"However when a device attempts to use the bus the line will be pulled to 0V."
That's it exactly: the bus is high only if all devices set it high. Just like in an AND gate the output is high only if all inputs are high.
Tri-state logic is not the way to control the bus; if one device sets it high, and another sets it low, you have a short circuit. Usually there's a passive pullup, which keeps the bus at its high level. Each device controls an open drain FET to pull it low.
Like you said, in this setup there's no way to determine how many devices are pulling it low simultaneously.
This is indeed active high logic: for the AND function the bus is active (high) if all inputs are active (high). If a device pulls the bus low it's just putting a low level on it.
The external data-bus width doesn't always agree with the processor's internal structure. A well-known example is the old Intel 8088 processor, which was identical to the 16-bit 8086 internally, but had an 8-bit external bus.
Databus width is not a real indicator of the processor's power, though a less wide bus may affect data throughput. The actual power of a processor is determined by the CPU's ALU, for Arithmetic and Logic Unit. 8-bit microcontrollers will have 8-bit ALUs which can process data in the range 0..255. That's enough for text processing: the ASCII character table only needs 7 bits. The ALU can do some basic arithmetic, but for larger numbers you'll need software help. If you want to add 100500 + 120760 then the 8-bit ALU can't do that directly, not even a 16-bit ALU can. So the compiler will split numbers to do separate calculations on the parts, and recombine the result later.
Suppose you have a decimal ALU, which can process numbers up to 3 decimal digits. The compiler will split the 100500 in 100 and 500, and the 120760 into 120 and 760. The CPU can calculate 500 + 760 = 260, plus an overflow of 1. It takes the overflow digit and add that to the 100 + 120, so that the sum is 221. It then recombines the two parts so that you get the final result 221260. This way you can do anything. The three digits were no objection for processing 6 digits numbers, and you can write algorithms for processing 10-digit number or more. Of course the calculation will take longer than with an ALU which can do 10-digit calculations natively, but it can be done.
Any computer can simulate any other computer.
The humble 8-bit processor can do exactly what a supercomputer can, given the necessary resources, and the time. Lots of time :-).
A concrete example are arbitrary precision calculators. Most (software) calculators have something like 15 decimal digits precision; if numbers have more significant digits it will round them and possible switch to mantissa + exponent form to store and process them. But arbitrary precision expand on the example calculation I gave earlier, and they allow to multiply
\$ 44402958666307977706468954613 \times 595247981199845571008922762709 \$
for example, two numbers (they're both prime) which would need a wider databus than my PC's 64-bit. Extreme example: Mathematica gives you \$\pi\$ to 100000 digits in 1/10th of a second. Calculating \$e^{\pi \sqrt{163}}\$ \$^{(1)}\$ to 100000 digits takes about half a second. So, while you would expect working with data wider than the databus to be taxing, it's often not really a problem. For a PC running at 3 GHz this may not be surprising, but microcontrollers get faster as well: an ARM Cortex-M3 may run at speeds greater than 100 MHz, and for the same money you get a 32-bits bus too.
\$^{(1)}\$ About 262537412640768743.99999999999925007259, and it's not a coincidence that it's nearly an integer!
Best Answer
It looks as though the design is supposed to perform a 48x51-bit multiply in 48 steps, with each step either adding the "A" register to the product register or not. It also appears to shift the "A" register, which isn't necessary. If you want to load the B register, start the machine, and then have a result ready to be read, your product register needs to be large enough to hold the entire product (the sum of the two multiplcands' lengths); the adder will have to be that same width if you shift "A" as you're going along. If instead of shifting the "A" register, you have the product either compute (Product >> 1) or (Product >> 1)+(A << 47) as bits shift out from the "B" register, then the adder only needs to add two 51-bit numbers for a 52-bit result.
Note also that for a small increase in complexity, you can double the speed of your multiplier by having the ALU choose among five operations on each step: (Product >> 2), (Product >> 2)+(A << 46), (Product >> 2)+(A << 47), (Product >> 2)-(A << 46), or (Product >> 2)-(A << 47). Look up "Booth's Algorithm" for more information.