Building and using large content-addressable memories is expensive and difficult. The most common design for a CAM is to have a number of addressable memory cells that compare their contents to a matched value that is passed in, and if they match they trigger a set of associated cells to produce output on a shared bus. This means two things:
As well as the value to output, we need memory cells for the addressable content, and additional logic to perform the comparison between that content and the value being searched for. Also, because there is no single predetermined location each value can be stored in, some mechanism of allocating space for new entries to be added is required[1]. Taken together, these make the size of a CAM much larger than the size of a traditional memory that stores the same amount of data.
Because the units output to a shared bus, they must be arranged so that only one at a time can match. This means that before an item can be added to a CAM the CAM must first be searched to ensure that it won't conflict with existing entries. If there is a conflict, the existing entry must be deleted before the new entry can be added. If the data is required to persist indefinitely (i.e. the CAM isn't being used as a cache, as most current applications of CAMs in processors are) then this means the addressable content must always be unique -- this requirement usually means that the size of addressable content needs to be quite large, further adding to the cost of implementation.
There is also a development process reason why architectures that work like this are likely to remain rare for the foreseeable future: a large proportion of CPU development work these days is performed using FPGAs to run the processor being designed. FPGAs do not usually have the kind of shared bus arrangement that CAMs are implemented with -- to select among multiple possible source values, an FPGA uses multiplexers instead, but large trees of multiplexers are slower than such a shared bus, so large CAMs in FPGAs are not usually very fast, making developing architectures that use them particularly difficult.
For these reasons, while it is possible to use CAMs to accelerate applications that use small amounts of memory, building an entire modern general purpose computer using them for all storage seems to be unjustifiably tricky.
[1] - this can be mitigated a little by the use of set-associative CAMs, but that's only really useful for caches as it makes it more likely for conflicts to arise that require an element to be deleted from the memory when it is still needed, which is an acceptable tradeoff for a cache but not for the kind of use proposed here.
Your question is a little confused, but perhaps this will help clear it up.
There are two areas to consider:
- whether the execution core can directly operate on items in memory
- the speed of operations on memory
Small embedded microcontrollers
These small microcontrollers have no external RAM. All RAM is internal, but some of it is used for specific things like registers.
For example, the Microchip PICs you mention have a "W" register. This is just in normal RAM like everything else, but instructions with two operands usually require one of them to be in the W register.
This greatly simplifies the design of the microcontroller at the electronics level and keeps costs/power low. It also has other benefits like predictable timing (in cycles) for instructions.
This is why you will see instructions that load W with a value, operate on it and then copy it back from W to elsewhere in memory. The compiler uses the register because it has to.
Larger processors
Other processors (CPUs) such as x86/64 have external RAM which is a big difference. Notice now a "register" means something very different because we have different types of memory.
External to the CPU is large quantities of RAM, internal to the CPU are a number of smaller blocks of memory. Some of these are storage registers that hold an amount of data, usually the same as the data width of the architecture. So for a 32 bit Intel processor the registers (such as EAX, EBX etc) are 32 bits wide.
These processors have more complicated instructions that can often operate on either registers or external RAM. Data for an instruction does not always need to be in a register. Therefore why would we bother? The answer is speed. Where there is a choice the compiler will use registers to reduce execution time.
These complicated processors have different access times for different types of memory. Registers that are on the CPU die are very quick to access. So if you have a variable which is in constant use throughout some code it makes sense to load it into a register, operate on it repeatedly and then copy it back to external RAM when finished.
Best Answer
At minimum, you need a branching instruction which can determine whether the previous addition generated a carry. Also, performance will be improved enormously if you expand your register set to three or four registers.
Start with one operand in R0 and the other in R1. Start with R2 zero.
Repeat the following sequence eight times (first ADD may be skipped on first pass)
Note that if the instruction set includes an add-with-carry instruction, using it for the second instruction would cause this instruction sequence to yield a 16-bit result in R0:R2. Further, if R2 starts non-zero, its value times 256 will be added to the outgoing result.