You are not going to conclusively be able to come up with a result, not possible. Benchmarking is always subjective and can often be and is often used to give the desired result (A is better, B is better, C is better etc).
Number of instructions is not relevant, any more than number of registers would be etc. Number of transistors is interesting, but comparing a single soc chip vs a processor that required external chips to provide the same functionality. Or one chip may have large chunks turned off at any one time relative to the other, or may have a large chunk turned off to complete the benchmark, or one may have more transistors, but switch them less frequently than the other which has fewer, possibly leading to different power consumption.
Intel makes and sells chips which happens to have (much of) their stuff inside. Arm does not make chips, they sell IP. Just like how fast does this program in source code form run, varies widely depending on compiler options, processor, etc. That same IP can consume widely different amounts of power depending on the foundry and cell library and process used to implement it. Same architecture, same clock rate, different power consumption. So right off you are comparing apples to oranges in yet another way. I cant think of a real case where an arm core is all that is in the chip, generally you wrap the arm in the chip with a lot of stuff, stuff that with the other processor would be off chip. The proper comparison would be the whole system not just the power of the processor.
This takes you into clocking differences, one processor may be way more efficient than another and can perform the same benchmark at a different clock rate or otherwise using less hardware or power or whatever. Very easy to write a benchmark that runs on a small battery powered microcontroller board that uses less power than an x86 computer even if the x86 is or could be grossly underclocked. Just as easy to write a benchmark that runs lightning fast on an x86 that takes that microcontroller an eternity to finish. Even if you clocked them the same or could.
Just compilers alone make the same computer run the same source code and vastly different speeds. It is simply not possible to compare two systems in this way except for clearly stating exactly the benchmark. This specific code compiled for speed using this compiler which was hand checked to produce this quality of optimized code, ran on this specific system with the system consuming this much power. This other system using this compiler hand checked to provide similar optimization, required this clock rate and this much power to execute in about the same time. Repeat for each of the infinite number of possible benchmark applications users might be interested in.
The mips/mhz comparison relies heavily on the compiler and the application, big variations in mips on the same system with no hardware changes. In no way can you really compare two processors with this method. Published mips to mhz is just marketing fluff, ignore it. Likewise you can put as much faith in the published power consumption as you can in the mips/mhz numbers, it was based on some benchmark, if your application is not the same benchmark, what good is it?
You will need to build a number of systems (lay out each board design specific to the benchmark) and attempt to reduce the number of variables, or ideally take the approach of making the bare minimum system, max optimization, capable of running the benchmark at exactly X amount of time. Repeat for the other system, then compare the power consumption of the whole system for the duration of the execution of that benchmark. Repeat for the millions of different benchmarks, in order to get a fair and general comparison it may not be possible to reduce the results in any conclusive way.
For an architecture difference you ideally want to have the processors built at the same foundry using the same cell library and same process, etc. If you are willing and able to license competing cores, populate the chips to the same level, use similar bus rates and as much similar external hardware (the system buses are no doubt different, making a common bus from them might give one an unfair advantage). Same amount of caches with same advantages, etc. You might have a better chance at a comparison that actually looks real. This would be the only way to come up with something plausible, same benchmark run on different architectures made at the same foundry, same cell library, same process, same cache size, same dram, etc. Can still manipulate the benchmarks to make either one the fastest or lowest power consumer.
What would be more interesting is an empirical comparison. Take or create benchmarks one at a time, look at the various ways to generate code from the compiler. Examine the buses that you can examine, get a feel for fetch sizes. If possible with the fixed vs variable word length instructions can you tell from the buses where the variable length decisions are made, first byte tells you might need to examine the second byte, second byte may make you realize you need 4 more bytes for the immediate, now you can execute. How much has to be jammed in near the decoder to make this efficient? How much do you have to discard and fetch if there is a branch, how fast does this happen? You have to look at quantity of code to perform similar tasks, due to different number of registers (real or virtual)(x86 is microcoded or many are, arm is not microcoded) how often does the code have to swap out registers on the stack (very easy to write benchmarks that punish one architecture relative to another for this). x86 can store more program in the same size cache than an arm, but the arm is more deterministic in decoding that code. x86 incurs more alignment punishments than arm, as it lends itself to not be aligned where arm is either encouraged or forced. Can you construct benchmarks that show an advantage for each instruction set, should be very easy to make a loop that fits x86 instructions within a cache of some size, but does not fit arm instructions in the cache of the same size. might be easy to have a benchmark that branches heavily that might show arms advantages or at least shoes one branch predictor vs another. clocks and power are still out of the picture, but performance at least at some layer you can see and the project that into the caches and dram responses to finish the understanding.
Anyway that was a tangent, you cannot compare two processors in this way and have those in the know accept the results as anything meaningful. The masses may be fooled, but not those who know what is going on. Empirically demonstrating advantages and disadvantages, that might be more doable and interesting to all. comparing opencores in the same fpga, that might be interesting as well, but one commercial processor chip on a commercial board, compared to IP that can be implemented many many different ways on many different boards, just wont be plausible.
The main reason why ARM processors are not clocked at 4GHz is power consumption. Architecture, fabrication, etc do play a big role, but the reality is that a tablet or mobile phone needs to last as much as it can off a battery, so all those factors are designed so that power consumption will be minimized. When going for lower power consumption, you sacrifice performance because of design choices in the node, architecture etc. Higher frequency is a battery killer because:
P = CV2f
Where C is a capacitance, V is the voltage, and f is the frequency. So it varies linearly with frequency, and it's why frequency scaling is so prevalent, even in laptops.
Best Answer
All microprocessors, and indeed all synchronous digital circuits work in what is called a "Register Transfer Level". Basically all that any microprocessor does is loading values into registers from different sources. Those sources can be memory, other registers or the ALU (Artihmitical-Logical Unit, a calculator inside the processor). Some of the registers are simple registers inside the procesor, some registers can be special function registers that are located around the CPU, in 'peripherals' such as I/O ports, memory management unit, interrupt unit, this and that.
In this model, 'Instructions' are basic sequences of register transfers. Normally it doesn't make sense to give the programmer the ability to control each register transfer individually, because not all of the possible register transfer combinations are meaningful, so allowing the programmer to express them all would be wasteful in terms of memory consumption. So basically each processor declares a set of sets of register transfers that it allows the programmer to ask the processor to do, and these are called 'Instructions'.
For example ADD A, B, C might be an operation where the sum of registers A and B is placed into register C. Internally, that would be three register transfers: Load adder left input from A, load adder right input from B, then load C from adder output. Additionally, the processor makes the necessary transfers to load memory address register from program counter, load instruction register from memory data bus, and finally load program counter from program counter incrementer.
The 8086 used an internal ROM look-up table to see which register transfers make each instruction. The contents of that ROM were quite freely programmable by the designers of the 8086 CPU so they chose instruction sequences, which seemed useful for the programmer, instead of choosing sequences which would be simple and fast to execute by the machine. Remember, that in those days most software was written in assembly language, so it made sense to make that as easy as possible for the programmer. Later on, Intel designed 80286, in which they made, what now seems, a critical error. They had some unused microcode memory left and they thought that they might as well fill it with something, and came up with a bunch of instructions just to fill the microcode. This bit them in the end, as all those extra instructions needed to be supported by the 386, 486, Pentium and later processors, which didn't use microcode any more.
ARM is a lot newer processor design than the 8086 and the ARM people took a different design route. By then, computers were common and there were a lot of compilers available. So instead of designing an instruction set that is nice for the programmer, they chose an instruction set which is fast for the machine to execute and efficient for the compiler to generate code on. And for a while, the X86 and ARM were different in the way that they execute instructions.
Time then goes by and CPUs become more and more complex. And also microprocessors are designed using computers and not pencil and paper. Nobody uses microcode any more, all processors have a hardwired (pure logic) execution control unit. All have multiple integer calculation units and multiple data buses. All translate their incoming instructions, reschedule them and distribute them among the processing units. Even old RISC instruction sets are translated into new RISC operation sets. So the old question of RISC versus CISC doesn't really exist anymore. We're again back in the register transfer level, programmers ask CPU's to do operations and CPU's translate them into register transfers. And whether that translation is done by a microcode ROM or hardwired digital logic, really isn't that interesting any more.