I have a desktop running as a ubuntu server at another office. Lately its been shutting itself down once in a while, and I'm a bit unsure how to diagnose this. The syslog looks like this:
May 20 15:42:35 hostname sensord: Chip: coretemp-isa-0000 May 20 15:42:35 hostname sensord: Adapter: ISA adapter May 20 15:42:35 hostname sensord: Core 0: 67.0 C May 20 15:42:35 hostname sensord: Core 1: 66.0 C May 20 15:42:35 hostname sensord: Core 2: 61.0 C May 20 15:42:35 hostname sensord: Core 3: 58.0 C May 20 16:04:16 hostname kernel: [ 5243.049529] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1) May 20 16:04:16 hostname kernel: [ 5243.050011] CPU0: Core temperature/speed normal May 20 16:05:48 hostname kernel: [ 5335.083540] CPU2: Core temperature above threshold, cpu clock throttled (total events = 1) May 20 16:05:48 hostname kernel: [ 5335.084028] CPU2: Core temperature/speed normal May 21 16:06:52 hostname kernel: [ 5399.816039] mce: [Hardware Error]: Machine check events logged
At first i suspected a broken fan or something thermal, and activated sensord. But the temperatures seems stable over time.
I've install mcelog and the deamon is running. Pretty much waiting for it to happen again to see if the mcelog makes any sense.
The mcelog indicates that it's a thermal issue, I have logs like the one below which match with the times of the Gitlab server backup cron job.
MCE 0 CPU 0 THERMAL EVENT TSC 16ec0aadec3a0 TIME 1401260314 Wed May 28 08:58:34 2014 Processor 0 heated above trip temperature. Throttling enabled. Please check your system cooling. Performance will be impacted STATUS 88020003 MCGSTATUS 0 MCGCAP 806 APICID 0 SOCKETID 0 CPUID Vendor Intel Family 6 Model 15 Hardware event. This is not a software error.
I've also done some testing today on stressing the system by
stress -c 4 -i 1 -m 1 -t 120 and I very quickly reach 100 C on CPU temp.
coretemp-isa-0000 Adapter: ISA adapter Core 0: +100.0°C (high = +84.0°C, crit = +100.0°C) Core 1: +96.0°C (high = +84.0°C, crit = +100.0°C) Core 2: +85.0°C (high = +84.0°C, crit = +100.0°C) Core 3: +79.0°C (high = +84.0°C, crit = +100.0°C)
I suspect that the heatsink isn't properly mounted, and I will check this when I find the time to.
I'll check the heatpaste and sink of the cpu, as a quick fix.
I got hold of a used Dell PowerEdge R200 to replace this server, and I will try to get set it up next week. Thank you very much for the advice.