CPU – How Does CPU’s Machine Check Architecture Work?

computer-architecturecpuhardware

Modern CPUs can alert the OS when itself is malfunctioning, i.e. logically incorrect, and apparently, this is supported by a hardware diagnostic feature called Machine Check Architecture. I can imagine how this works on instruction fetch, e.g., if the fetched machine instruction falls outside of possible 0's and 1's patterns allowed by the ISA, something is definitely off. How is this mechanism implemented on a hardware level at other modules inside the CPU?

P.S. I know the CPU cannot possibly detect every error if itself is starting to function incorrectly, e.g., due to heat or overclocking, and that it can only catch a small set of errors that can be detected without knowing what the correct runtime result is. The underlying logic must be that if even this subset of easy-to-detect errors starts to show in rapid succession, then the OS should go to the Blue Screen of Death. No need to explain this. I'm just curious how large this subset of easy-to-detect errors actually is.

Best Answer

I know the CPU cannot possibly detect every error if itself is starting to function incorrectly, e.g., due to heat or overclocking, and that it can only catch a small set of errors

Correct.

I'm just curious how large this subset of easy-to-detect errors actually is.

In the Intel 64 and IA-32 Architectures Software Developer’s Manual, Chapter 15.1: Machine-Check Architecture contains a overview of functionality of MCA/MCE.

The first source of error is the internal modules and subsystems within the CPU. If they're found to be in an invalid state, errors could be generated. Apparently many are related to the CPU's cache coherency protocol (understandable since all memory accesses must pass through it). Sometimes, watchdog timers are used to detect if some operations cannot be completed due to lockup.

Next, a common source of detectable errors come from parity bits of a CPU's internal L1/L2/L3 caches and buses. They're often parity or ECC-protected and an error would be obvious. If the external DRAM is ECC-protected, the MCA mechanism is also used to report DRAM ECC errors to the operating system.

Another large source of errors is I/O errors of a CPU's internal and external interconnects. A modern CPU contains numerous data buses and interconnects. Within the CPU, the CPU uses a ring or mesh bus interconnect for communication between cores. Outside the CPU, many Intel CPUs use a QPI/UPI bus to connect one CPU socket to another. A DMI bus is used for connecting the CPU and the PCH (a.k.a southbridge), and the DDR bus connects the CPU and DRAM. All of these interconnects can generate transmission errors due to invalid coding, invalid framing, timeouts. In some CPUs, unrecoverable PCIe transmissions can also generate MCE. Another example is that modern CPUs use the SVID protocol to communicate with the power management IC for voltage adjustments. It's a very simple serial protocol. Even transmission errors in this bus can also generate MCEs in some CPU models.

Summarizing all the error codes here would be a huge task, so for more information, please refer to the mentioned Intel datasheet.