Jan's Razor: In a chip multiprocessor design, strive to leave out all but the minimal kernel set of features from each processing element, so as to maximize processing elements per die.
-- Jan Gray
If your application really does need to do lots of work with 32-bit numbers, then the "minimal kernel set of features" for that application might need to include 32-bit operations.
On the other hand, as Chris Stratton pointed out,
if you need to do lots of work where 8 bits are adequate and only rarely are 32-bit numbers needed, then lots of 8 bit cores will likely give you higher net performance
than using a few 32-bit cores.
I see you are currently considering
- one or a few 32-bit cores
- several 16-bit cores
- many 8-bit cores.
There are several other possibilities that in some situations give better performance than any of the above:
- One 32-bit core, and many smaller cores
- Dynamic reconfiguration: reconfigure the soft microprocessors in a FPGA to get one or a few 32-bit processors at times when lots of 32-bit calculations are necessary, and reconfiguring the FPGA to get lots of 8-bit processors when that will give adequate precision and better performance.
- processing elements with lanes narrower than 8 bits.
While there are many multicore systems that include only 32-bit cores, and many that include only 8-bit cores, I see that the Wikipedia: multi-core processor article mentions many chips that include both a 32-bit processor and a bunch of 16-bit or 8-bit processors.
As I mentioned earlier --
Cheapest FPGAs? :
Simple (i.e., without a MMU) 32-bit CPUs require about 4 times the FPGA resources of an 8 bit CPU.
Full-fledged Linux requires a CPU with a MMU (such as the NIOS II/f). A 32-bit CPU with a MMU requires about 4 times the FPGA resources of a 32-bit CPU without a MMU.
By the way, 8 bits is not the "minimum".
You may be surprised to learn that all computers built before the 1951 Whirlwind operated on less than 8 bits at once.
Most of the early massively parallel processors operated on less than 8 bits at a time -- the Goodyear Massively Parallel Processor, the Connection Machine CM-2, the 2003 VIRAM1 chip, etc.
The most recent report I've seen shows that 4-bit CPUs still outsell (by volume) 32-bit CPUs. Have you seen a more recent report?
( Do 4-bit CPUs still outsell 32-bit CPUs in unit volume? )
The 8085 has different instructions for accessing main memory and I/O 'memory'. In addition to the standard memory interface pins the 8085 also provides a pin that identifies whether a memory access cycle is accessing main or I/O. This extra line is used in the select logic of both main and I/O 'memory'.
But there is no law that I/O can only be accessed by I/O instructions: in a small system the highest address line could be used to distinguish between memory (a15=0) and I/O (a15=1), so we get 32Kb for real memory (ROM and RAM) and 32 Kb for I/O.
Note that it is even possible to use the I/O addresses to access RAM, but that is less useful because there are only 256 I/O addresses and the addressing modes available for these addresses are very limited.
Best Answer
When the 8088/86 came out, 8 bit memories were the norm and affordable. To have a 16 bit bus you needed two parts, so it cost you more in some way, half sized parts but two of them or same size parts and two of them would be twice the cost. So the design allows for the 8088 to use 8 bit parts and do two bus transfers or the 8086 and is capable of one transfer.
External and internal are not related. Many/most of the instructions from the 8086 are present today where the busses are much wider. And from edge of the processor core to the now dram (then sram) goes through a number of busses and transfer size changes, etc. The dram can be implemented using x8, x16, x32 parts. Typically 72 bits wide on a full sized motherboard, sometimes 64, so if you have say 8 or 9 parts on one side of the dimm then they are 8 bit wide parts, if both sides populated with 8 or 9 still 8 bit wide parts, but two ranks. 4 or 5 parts on a side then x16, etc. Because of the nature of the number of busses, the external geometry then as now doesnt matter so long as it confirms to the bus interface that the memory is connected to. It doesnt define what the internals are in any way shape or form.
As new DDR tech comes out, DDR2, DDR3, DDR4, etc. you often find them starting with x8 parts being the most affordable, then x16 not long after and so on. density vs yield.
now phone memory lpddr4, etc those busses are like 16 bits wide and you just do more transfers per cache line than you would with a 64/72 bit wide bus. same instruction set inside, "byte addressable"
Kinda similar but the same house you live in could be a on a one way single lane road, a two way road, two way with a median, four lane, four lane with a median, could be strictly next to a big parking lot and not on a road, etc. Does not in any way affect the sheets on your bed or use of towels in the bathroom. The two things are not related, the overall system design is such that it tries to get affordable memory in quantity that is not horribly slow for the processor in question.