Every processor I've worked on does comparison by subtracting one of the operands from the other, discarding the result and leaving the processor's flags (zero, negative, etc.) alone. Because subtraction is done as a single operation, the contents of the operands don't matter.
The best way to answer the question for sure is to compile your code into assembly and consult the target processor's documentation for the instructions generated. For current Intel CPUs, that would be the Intel 64 and IA-32 Architectures Software Developer’s Manual.
The description of the CMP
("compare") instruction is in volume 2A, page 3-126, or page 618 of the PDF, and describes its operation as:
temp ← SRC1 − SignExtend(SRC2);
ModifyStatusFlags; (* Modify status flags in the same manner as the SUB instruction*)
This means the second operand is sign-extended if necessary, subtracted from the first operand and the result placed in a temporary area in the processor. Then the status flags are set the same way as they would be for the SUB
("subtract") instruction (page 1492 of the PDF).
There's no mention in the CMP
or SUB
documentation that the values of the operands have any bearing on latency, so any value you use is safe.
It probably refers to pipelining, that is, parallel (or semi-parallel) execution of instructions. That's the only scenario I can think of where it does not really matter how long something takes, as long as you can have enough of them running in parallel.
So, the CPU may fetch one instruction, (step 1 in the table above,) and then as soon as it proceeds to step 2 for that instruction, it can at the same time (in parallel) start with step 1 for the next instruction, and so on.
Let's call our two consecutive instructions A and B. So, the CPU executes step 1 (fetch) for instruction A. Now, when the CPU proceeds to step 2 for instruction A, it cannot yet start with step 1 for instruction B, because the program counter has not advanced yet. So, it has to wait until it has reached step 3 for instruction A before it can get started with step 1 for instruction B. This is the time it takes to start another instruction, and we want to keep this at a minimum, (start instructions as quickly as possible,) so that we can be executing in parallel as many instructions as possible.
CISC architectures have instructions of varying lengths: some instructions are only one byte long, others are two bytes long, and yet others are several bytes long. This does not make it easy to increment the program counter immediately after fetching one instruction, because the instruction has to be decoded to a certain degree in order to figure out many bytes long it is. On the other hand, one of the primary characteristics of RISC architectures is that all instructions have the same length, so the program counter can be incremented immediately after fetching instruction A, meaning that the fetching of instruction B can begin immediately afterwards. That's what the author means by starting instructions quickly, and that's what increases the number of instructions that can be executed per second.
In the above table, step 2 says "Change the program counter to point to the following instruction" and step 3 says "Determine the type of instruction just fetched." These two steps can be in that order only on RISC machines. On CISC machines, you have to determine the type of instruction just fetched before you can change the program counter, so step 2 has to wait. This means that on CISC machines the next instruction cannot be started as quickly as it can be started on a RISC machine.
Best Answer
The % usage you see, for example, in the Windows task manager, is an average value over a certain time period. And indeed, processing on a one-CPU machine works basically the way you already sketched in your question - the operating system assigns each process (and each thread inside the process) of a program a certain time slice for the execution of instructions, and then switches to another thread or process. Doing this many times per second creates the illusion of parallel processing even with only one CPU core. The part of the operating system which does this is called the scheduler.
But beware, this is a very simplified point of view, in reality, things are more complicated:
Different processes/threads may have different priorities, so the processes with higher priority are likely to get more instruction cycles than ones with lower priority.
Processes can willingly "wait" for certain events, and hand the execution over to other processes until that event (like a timer or I/O event) occurs.
As you can see in the Wikipedia article from the above link, different scheduling algorithms exist, and different operating systems implement different variants of them.
In case the machine has multiple CPU cores, using one core to 100% will show up as "100 / # of CPU cores" percent of the total available CPU usage, and the scheduler will have to distribute all processes and threads among all available CPUs.