It depends.
Some (CISC) CPUs have byte-wise loads that can address individual bytes so the byte of interest is the low-order 8-bits on the bus; the rest of the bits are masked off.
Many RISC CPUs will do word-load, barrel shift, while others will do word-load, bit shift and in the middle, are ones that do word-load, byte shift.
Some CPUs will do consecutive word-loads when a two-byte value spans a 32-bit boundary, shifting and masking the words together.
CPU families may do different implementations depending on the particular processor model. That explains why there is no description of the implementation; it's a decision only the vendor cares about.
As for performance, you will just have to test it on the particular CPU and memory configurations you care about.
It probably refers to pipelining, that is, parallel (or semi-parallel) execution of instructions. That's the only scenario I can think of where it does not really matter how long something takes, as long as you can have enough of them running in parallel.
So, the CPU may fetch one instruction, (step 1 in the table above,) and then as soon as it proceeds to step 2 for that instruction, it can at the same time (in parallel) start with step 1 for the next instruction, and so on.
Let's call our two consecutive instructions A and B. So, the CPU executes step 1 (fetch) for instruction A. Now, when the CPU proceeds to step 2 for instruction A, it cannot yet start with step 1 for instruction B, because the program counter has not advanced yet. So, it has to wait until it has reached step 3 for instruction A before it can get started with step 1 for instruction B. This is the time it takes to start another instruction, and we want to keep this at a minimum, (start instructions as quickly as possible,) so that we can be executing in parallel as many instructions as possible.
CISC architectures have instructions of varying lengths: some instructions are only one byte long, others are two bytes long, and yet others are several bytes long. This does not make it easy to increment the program counter immediately after fetching one instruction, because the instruction has to be decoded to a certain degree in order to figure out many bytes long it is. On the other hand, one of the primary characteristics of RISC architectures is that all instructions have the same length, so the program counter can be incremented immediately after fetching instruction A, meaning that the fetching of instruction B can begin immediately afterwards. That's what the author means by starting instructions quickly, and that's what increases the number of instructions that can be executed per second.
In the above table, step 2 says "Change the program counter to point to the following instruction" and step 3 says "Determine the type of instruction just fetched." These two steps can be in that order only on RISC machines. On CISC machines, you have to determine the type of instruction just fetched before you can change the program counter, so step 2 has to wait. This means that on CISC machines the next instruction cannot be started as quickly as it can be started on a RISC machine.
Best Answer
Lots of machine architectures have memory-memory instructions.
The IBM System/360 and its successors have a whole set of instructions that operate on two locations in memory (the Storage Storage (SS) group). "Move Character" (MVC) instruction copies up to 256 bytes from one memory location to another, and even has a clear definition for when the source and destination ranges overlap. Similarly there are Compare Logical Character (CLC) (which does a string-comparison), OR Character (OC), AND Character (NC), and XOR Character (XC), which are bitwise logical operators, etc. The also have a set of decimal arithmetic instructions, which only operate on memory - there aren't any registers for decimal math.
Then there are the memory-immediate instructions, which have one operand in memory and the other in the instruction itself. The DEC PDP-10 had Add One to Storage (AOS) and Subtract One from Storage (SOS). The IBM S/360 family had a wide range of Storage Immediate (SI) instructions, in which one operand was a memory location and the other was an 8-bit quantity in the instruction.