You are not going to conclusively be able to come up with a result, not possible. Benchmarking is always subjective and can often be and is often used to give the desired result (A is better, B is better, C is better etc).
Number of instructions is not relevant, any more than number of registers would be etc. Number of transistors is interesting, but comparing a single soc chip vs a processor that required external chips to provide the same functionality. Or one chip may have large chunks turned off at any one time relative to the other, or may have a large chunk turned off to complete the benchmark, or one may have more transistors, but switch them less frequently than the other which has fewer, possibly leading to different power consumption.
Intel makes and sells chips which happens to have (much of) their stuff inside. Arm does not make chips, they sell IP. Just like how fast does this program in source code form run, varies widely depending on compiler options, processor, etc. That same IP can consume widely different amounts of power depending on the foundry and cell library and process used to implement it. Same architecture, same clock rate, different power consumption. So right off you are comparing apples to oranges in yet another way. I cant think of a real case where an arm core is all that is in the chip, generally you wrap the arm in the chip with a lot of stuff, stuff that with the other processor would be off chip. The proper comparison would be the whole system not just the power of the processor.
This takes you into clocking differences, one processor may be way more efficient than another and can perform the same benchmark at a different clock rate or otherwise using less hardware or power or whatever. Very easy to write a benchmark that runs on a small battery powered microcontroller board that uses less power than an x86 computer even if the x86 is or could be grossly underclocked. Just as easy to write a benchmark that runs lightning fast on an x86 that takes that microcontroller an eternity to finish. Even if you clocked them the same or could.
Just compilers alone make the same computer run the same source code and vastly different speeds. It is simply not possible to compare two systems in this way except for clearly stating exactly the benchmark. This specific code compiled for speed using this compiler which was hand checked to produce this quality of optimized code, ran on this specific system with the system consuming this much power. This other system using this compiler hand checked to provide similar optimization, required this clock rate and this much power to execute in about the same time. Repeat for each of the infinite number of possible benchmark applications users might be interested in.
The mips/mhz comparison relies heavily on the compiler and the application, big variations in mips on the same system with no hardware changes. In no way can you really compare two processors with this method. Published mips to mhz is just marketing fluff, ignore it. Likewise you can put as much faith in the published power consumption as you can in the mips/mhz numbers, it was based on some benchmark, if your application is not the same benchmark, what good is it?
You will need to build a number of systems (lay out each board design specific to the benchmark) and attempt to reduce the number of variables, or ideally take the approach of making the bare minimum system, max optimization, capable of running the benchmark at exactly X amount of time. Repeat for the other system, then compare the power consumption of the whole system for the duration of the execution of that benchmark. Repeat for the millions of different benchmarks, in order to get a fair and general comparison it may not be possible to reduce the results in any conclusive way.
For an architecture difference you ideally want to have the processors built at the same foundry using the same cell library and same process, etc. If you are willing and able to license competing cores, populate the chips to the same level, use similar bus rates and as much similar external hardware (the system buses are no doubt different, making a common bus from them might give one an unfair advantage). Same amount of caches with same advantages, etc. You might have a better chance at a comparison that actually looks real. This would be the only way to come up with something plausible, same benchmark run on different architectures made at the same foundry, same cell library, same process, same cache size, same dram, etc. Can still manipulate the benchmarks to make either one the fastest or lowest power consumer.
What would be more interesting is an empirical comparison. Take or create benchmarks one at a time, look at the various ways to generate code from the compiler. Examine the buses that you can examine, get a feel for fetch sizes. If possible with the fixed vs variable word length instructions can you tell from the buses where the variable length decisions are made, first byte tells you might need to examine the second byte, second byte may make you realize you need 4 more bytes for the immediate, now you can execute. How much has to be jammed in near the decoder to make this efficient? How much do you have to discard and fetch if there is a branch, how fast does this happen? You have to look at quantity of code to perform similar tasks, due to different number of registers (real or virtual)(x86 is microcoded or many are, arm is not microcoded) how often does the code have to swap out registers on the stack (very easy to write benchmarks that punish one architecture relative to another for this). x86 can store more program in the same size cache than an arm, but the arm is more deterministic in decoding that code. x86 incurs more alignment punishments than arm, as it lends itself to not be aligned where arm is either encouraged or forced. Can you construct benchmarks that show an advantage for each instruction set, should be very easy to make a loop that fits x86 instructions within a cache of some size, but does not fit arm instructions in the cache of the same size. might be easy to have a benchmark that branches heavily that might show arms advantages or at least shoes one branch predictor vs another. clocks and power are still out of the picture, but performance at least at some layer you can see and the project that into the caches and dram responses to finish the understanding.
Anyway that was a tangent, you cannot compare two processors in this way and have those in the know accept the results as anything meaningful. The masses may be fooled, but not those who know what is going on. Empirically demonstrating advantages and disadvantages, that might be more doable and interesting to all. comparing opencores in the same fpga, that might be interesting as well, but one commercial processor chip on a commercial board, compared to IP that can be implemented many many different ways on many different boards, just wont be plausible.
Best Answer
Generally speaking, a cache is a layer which abstracts the access to memory. When a piece of information is needed, it is specified by its address. All entries in the cache are tagged with the memory address of the datum that they hold. When the processor requests a datum, the cache control circuitry searches the cache for a matching address.
If the cache is fully associative than the entire address (except for the least significant bits) is matched against the entire cache. This matching is not a linear search, but an associative lookup. The cache entries somehow compare themselves to the address in parallel and one of them announces itself as a match.
If the cache is set associative then some of the address bits are used to directly select a bucket. For instance if there are 16 buckets, then four bits from the address can be taken as a bucket address 0 to 15. Then an associative lookup for the address takes place within just that bucket. This means that for any given memory address, we know which cache bucket it maps to, but not which specific cache line within that bucket.
If a cache is direct mapped then some of the address bits are used to select a single cache line, which either holds data for that address or not. So there is no associative lookup. Each address is mapped to a just a single cache line. (If a program alternately accesses two items at different addresses that map to the same cache line, the performance is bad. This is the worst/cheapest kind of cache.)
When there is a cache hit, then the item can be quickly supplied to the requesting circuit out of the cache. If there is a miss, then a memory access cycle has to be executed. The data is not only given to the requesting circuit, but also installed into the cache (replacing something else that has not recently been accessed).
Instruction caches tend to be specialized, to take advantage of the access patterns and the structure of the data. The cache may work at a higher level, combined with the instruction decoding. The requesting circuit asks not simply for an instruction opcode, but it demands a decoded instruction. The combined caching and decoding circuitry provides it. The idea is the same. Take the address and find a decoded instruction for that address. If it's not found in the cache, then it must be fetched and decoded.
So the answer to the question "how does the processor know" is that the processor is divided into logical units, and these units provide services to each other. The units which request data from memory do not have to be aware of the cache. The responsibility is put into the cache control circuitry. I.e. inside the ovearall processor there effectively a smaller processor which in fact "does not not know" that the data is in a cache.