Centos – Finding Source of 100% CPU Usage

centosdell-poweredge

I recently had a crash on a Dell Poweredge 2850 that I traced back to a bad RAID memory card. I replaced the card and reset the battery and got the server to boot again.

After booting up I noticed that one of the CPUs always goes to 100%. It is usually CPU 1 (2nd CPU) but out of about 10 boots it was CPU 3 (4th CPU) once.

The process that is causing the high load is events/1 (or events/3 the one time it happened on core 3). I've looked through dmesg and didn't find anything abnormal. Does anyone have any suggestions as to how I may be able to find what is actually causing the CPU usage?

I also noticed that when I plug in a monitor at boot that on the CentOS loading screen the loading bars get to around half way then the screen blacks out (no login screen is shown). Otherwise everything starts up and runs normally.

Server info:

CentOS release 6.9 (Final)

CPU Info:

processor   : 1
vendor_id   : GenuineIntel
cpu family  : 15
model       : 4
model name  : Intel(R) Xeon(TM) CPU 3.00GHz
stepping    : 3
microcode   : 5
cpu MHz     : 3000.000
cache size  : 2048 KB
physical id : 3
siblings    : 2
core id     : 0
cpu cores   : 1
apicid      : 6
initial apicid  : 6
fpu     : yes
fpu_exception   : yes
cpuid level : 5
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc pebs bts pni dtes64 monitor ds_cpl cid cx16 xtpr
bogomips    : 5985.27
clflush size    : 64
cache_alignment : 128
address sizes   : 36 bits physical, 48 bits virtual
power management:

Please add a comment if you want to see any specific config files or outputs.

UPDATE 1:

cat /proc/interrupts

            CPU0       CPU1       CPU2       CPU3       
   0:        133          0          0          1   IO-APIC-edge      timer
   1:          0          0          0          2   IO-APIC-edge      i8042
   4:          0          0          0          2   IO-APIC-edge    
   8:          0          0          0          1   IO-APIC-edge      rtc0
   9:          0          0          0          0   IO-APIC-fasteoi   acpi
  12:          0          0          0          4   IO-APIC-edge      i8042
  14:          0          0          0        147   IO-APIC-edge      ata_piix
  15:          0          0          0          0   IO-APIC-edge      ata_piix
  16:          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb2
  18:          0          0          0        301   IO-APIC-fasteoi   uhci_hcd:usb4, radeon
  19:          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
  23:          0          0          0         49   IO-APIC-fasteoi   ehci_hcd:usb1
  46:          0          0       3804       4767   IO-APIC-fasteoi   megaraid
  64:          0        288          0        104   IO-APIC-fasteoi   eth0
 NMI:          0          1          0          0   Non-maskable interrupts
 LOC:      24325      76909      25269      31039   Local timer interrupts
 SPU:          0          0          0          0   Spurious interrupts
 PMI:          0          1          0          0   Performance monitoring interrupts
 IWI:          0          0          0          0   IRQ work interrupts
 RES:       2295        703       1357        886   Rescheduling interrupts
 CAL:       3986        421        156        175   Function call interrupts
 TLB:        526         95        803       3519   TLB shootdowns
 TRM:          0          0          0          0   Thermal event interrupts
 THR:          0          0          0          0   Threshold APIC interrupts
 MCE:          0          0          0          0   Machine check exceptions
 MCP:          1          1          1          1   Machine check polls
 ERR:          0
 MIS:          0

sar

Linux 2.6.32-696.16.1.el6.x86_64 (HOSTNAME)     12/30/2017  _x86_64_    (4 CPU)

09:57:37 AM       LINUX RESTART

10:00:01 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
10:10:01 AM     all      0.10      0.07     21.09      1.49      0.00     77.25
10:20:01 AM     all      0.15      0.00     21.00      0.00      0.00     78.85
10:30:01 AM     all      0.11      0.00     20.92      0.00      0.00     78.97
10:40:01 AM     all      0.09      0.00     20.81      0.01      0.00     79.09
Average:        all      0.11      0.02     20.96      0.37      0.00     78.54

12:35:32 PM       LINUX RESTART

top

Tasks: 164 total,   2 running, 162 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.2%us, 20.8%sy,  0.0%ni, 78.9%id,  0.0%wa,  0.0%hi,  0.1%si,  0.0%st
Mem:   8058904k total,   453272k used,  7605632k free,    22240k buffers
Swap:  8191996k total,        0k used,  8191996k free,   174064k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                                                                               
   20 root      20   0     0    0    0 R 99.9  0.0   5:50.67 events/1                                                                                                                                              

UPDATE 2:

Once I regained physical access to the box I completely swapped out the PERC controller with one from a parts server. I reseated the memory card and the battery. Since RAID config did not match due to the new hardware I restored it from disk. After booting up I got the same 100% CPU usage.

I reset the BIOS/CMOS by pulling the CMOS battery and holding the power button down 10 seconds. Rebooted and set up RAID to read from hard drive again. CPU still at 100%.

I ran yum update and rebooted. Still 100%. Below is top showing individual CPUs.

top

top - 11:59:19 up 21 min,  1 user,  load average: 1.00, 0.97, 0.72
Tasks: 164 total,   2 running, 162 sleeping,   0 stopped,   0 zombie
Cpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu1  :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu3  :  0.0%us,100.0%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8058904k total,   456996k used,  7601908k free,    22480k buffers
Swap:  8191996k total,        0k used,  8191996k free,   173792k cached

sar

Linux 2.6.32-696.16.1.el6.x86_64 (HOSTNAME)     01/04/2018  _x86_64_    (4 CPU)

10:40:45 AM       LINUX RESTART

10:50:01 AM     CPU     %user     %nice   %system   %iowait    %steal     %idle
11:00:01 AM     all      0.08      0.00     20.86      0.00      0.00     79.06
11:40:01 AM     all      0.00      0.00      0.00      0.00      0.00      0.00
11:50:01 AM     all      0.08      0.00     20.87      0.02      0.00     79.03
12:00:01 PM     all      0.08      0.00     20.89      0.00      0.00     79.02
Average:        all      0.00      0.00     20.83      0.00      0.00     79.78

cat /proc/interrupts

            CPU0       CPU1       CPU2       CPU3       
   0:        133          0          0          6   IO-APIC-edge      timer
   1:          0          0          0          2   IO-APIC-edge      i8042
   4:          0          0          0          2   IO-APIC-edge    
   8:          0          0          0          1   IO-APIC-edge      rtc0
   9:          0          0          0          0   IO-APIC-fasteoi   acpi
  12:          0          0          0          4   IO-APIC-edge      i8042
  14:          0          0          0        147   IO-APIC-edge      ata_piix
  15:          0          0          0          0   IO-APIC-edge      ata_piix
  16:          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb2
  18:          0          0        302        302   IO-APIC-fasteoi   uhci_hcd:usb4, radeon
  19:          0          0          0          0   IO-APIC-fasteoi   uhci_hcd:usb3
  23:          0          0          0         53   IO-APIC-fasteoi   ehci_hcd:usb1
  46:          0          0       4074       4912   IO-APIC-fasteoi   megaraid
  64:          0       4917          0        108   IO-APIC-fasteoi   eth0
 NMI:          0          0          0         28   Non-maskable interrupts
 LOC:     197497     401002     148354    1361329   Local timer interrupts
 SPU:          0          0          0          0   Spurious interrupts
 PMI:          0          0          0         28   Performance monitoring interrupts
 IWI:          0          0          0          0   IRQ work interrupts
 RES:       5891       1183       2828       8249   Rescheduling interrupts
 CAL:       3641       1441        156        184   Function call interrupts
 TLB:        837       3324        833        202   TLB shootdowns
 TRM:          0          0          0          0   Thermal event interrupts
 THR:          0          0          0          0   Threshold APIC interrupts
 MCE:          0          0          0          0   Machine check exceptions
 MCP:          6          6          6          6   Machine check polls
 ERR:          0
 MIS:          0

UPDATE 3:

I added the noapic and nolapic arguments to the Kernel command in GRUB. Here are the results from top and cat /proc/interrupts

top

top - 14:55:01 up 5 min,  1 user,  load average: 1.76, 1.27, 0.58
Tasks: 111 total,   2 running, 109 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.4%us, 99.6%sy,  0.0%ni,  0.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:   8059152k total,   442016k used,  7617136k free,    22252k buffers
Swap:  8191996k total,        0k used,  8191996k free,   173556k cached

cat /proc/interrupts

          CPU0       
  0:     447518    XT-PIC-XT-PIC    timer
  1:          2    XT-PIC-XT-PIC    i8042
  2:          0    XT-PIC-XT-PIC    cascade
  3:          1    XT-PIC-XT-PIC  
  4:          4    XT-PIC-XT-PIC  
  5:         50    XT-PIC-XT-PIC    ehci_hcd:usb1
  7:       8825    XT-PIC-XT-PIC    uhci_hcd:usb4, radeon, megaraid
  8:          1    XT-PIC-XT-PIC    rtc0
  9:          0    XT-PIC-XT-PIC    acpi
 10:          0    XT-PIC-XT-PIC    uhci_hcd:usb3
 11:       1586    XT-PIC-XT-PIC    uhci_hcd:usb2, eth0
 12:          4    XT-PIC-XT-PIC    i8042
 14:        148    XT-PIC-XT-PIC    ata_piix
 15:          0    XT-PIC-XT-PIC    ata_piix
NMI:          0   Non-maskable interrupts
LOC:          0   Local timer interrupts
SPU:          0   Spurious interrupts
PMI:          0   Performance monitoring interrupts
IWI:          0   IRQ work interrupts
RES:          0   Rescheduling interrupts
CAL:          0   Function call interrupts
TLB:          0   TLB shootdowns
TRM:          0   Thermal event interrupts
THR:          0   Threshold APIC interrupts
MCE:          0   Machine check exceptions
MCP:          2   Machine check polls
ERR:          0
MIS:          0

I also tried booting to another much older version of the Kernel (Centos 6.7) which yielded the same result as before: 100% CPU usage on a random core.

UPDATE 4:

I got distracted by another project and left the server on for a few hours. I checked top before shutting it down and noticed that the CPU usage had dropped back down to normal (less than 1% per core). I restarted to see if the problem would re-emerge and it did not. I want to know what caused this and am willing to continue trying different things to figure it out if anyone has any suggestions. The only thing I noticed out of the ordinary was a message in /var/spool/mail/root:

Invalid system activity file: /var/log/sa//sa04

This was generated before I checked top.

UPDATE 5:

I found the source of the problem! When I took a break to work on my other project I unplugged the monitor and took it with me. When I checked back in (via SSH) the CPU usage was normal. When I thought back to what may have changed the only thing I could think of was the monitor. To test the theory I rebooted with monitor plugged in. Voila! 100% CPU usage. I unplugged the monitor and CPU usage instantly dropped.

So now I am left wondering what is causing the CPU usage when a monitor is plugged in?

UPDATE 6:

lspci

00:00.0 Host bridge: Intel Corporation E7520 Memory Controller Hub (rev 09)
00:02.0 PCI bridge: Intel Corporation E7525/E7520/E7320 PCI Express Port A (rev 09)
00:04.0 PCI bridge: Intel Corporation E7525/E7520 PCI Express Port B (rev 09)
00:05.0 PCI bridge: Intel Corporation E7520 PCI Express Port B1 (rev 09)
00:06.0 PCI bridge: Intel Corporation E7520 PCI Express Port C (rev 09)
00:1d.0 USB controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #1 (rev 02)
00:1d.1 USB controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #2 (rev 02)
00:1d.2 USB controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB UHCI Controller #3 (rev 02)
00:1d.7 USB controller: Intel Corporation 82801EB/ER (ICH5/ICH5R) USB2 EHCI Controller (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev c2)
00:1f.0 ISA bridge: Intel Corporation 82801EB/ER (ICH5/ICH5R) LPC Interface Bridge (rev 02)
00:1f.1 IDE interface: Intel Corporation 82801EB/ER (ICH5/ICH5R) IDE Controller (rev 02)
01:00.0 PCI bridge: Intel Corporation 80332 [Dobson] I/O processor (A-Segment Bridge) (rev 06)
01:00.2 PCI bridge: Intel Corporation 80332 [Dobson] I/O processor (B-Segment Bridge) (rev 06)
02:0e.0 RAID bus controller: Dell PowerEdge Expandable RAID controller 4 (rev 06)
05:00.0 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge A (rev 09)
05:00.2 PCI bridge: Intel Corporation 6700PXH PCI Express-to-PCI Bridge B (rev 09)
06:07.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller (rev 05)
07:08.0 Ethernet controller: Intel Corporation 82541GI Gigabit Ethernet Controller (rev 05)
09:0d.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] RV100 [Radeon 7000 / Radeon VE]

UPDATE 7:

Adding noacpi and nomodeset to the boot options made the CPU usage problem disappear. CentOS also booted to a login screen instead of blacking out the monitor mid loading screen. What does this indicate?

Best Answer

It seems an IRQ storm caused by the megaraid driver (interrupt 46). This can be caused by an improper BIOS/driver setup or by a faulty hardware.

You can try to:

  • update your Dell PERC drivers and firmware
  • use a different (older/newer) kernel version
  • reset server CMOS/BIOS and/or update its firmware
  • change the affected hardware.