Understanding “discrete” sensors in ipmi

ipmi

Can anyone describe how a discrete sensor works in the IPMI world?

In the IPMI specification there are "sensors" for processor and memory that are of discrete type.

Is there really a SW/Firmware entity monitoring memory per say for ECC errors and then generates some event if something occurs? If so, is IPMI doing the actual testing to find ECC errors? Trying to grasp what is occurring under the covers of such a sensor.

Best Answer

There are generally two types of sensors in IPMI: Threshold and discrete. A threshold sensor is essentially an analog sensor to measure things like temperatures, voltages or fan speeds. A discrete sensor is just a binary sensor that only has two states, e.g. on/off, present/absent or NoError/Error. These sensors are grouped into a single 16bit value that must be interpreted as a bit field. And yes, this naming sucks because it suggests a very different meaning of the term "discrete".

How that sensor actually works is of course dependent on the measured item and the specific implementation, but in case of ECC RAM, IPMI will not (and could not!) check for errors itself. Instead, one approach to detect this error would be to watch the signal lines between the RAM modules and the memory controller that report an ECC error. If it detects a signal on those lines, the management interface could generate an IPMI error event that is independent of the error handling the primary hardware and OS will perform. Another approach would be to have the memory controller actively report that error to the management interface.

Related Topic