Centos – kernel exception action 0x6 frozen failed command: FLUSH CACHE

centoshard drivekernel

Since a few months, our server periodically hangs for a minute or two. The logs show these errors:

May 15 20:01:02 www kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 15 20:01:02 www kernel: ata2.00: failed command: FLUSH CACHE
May 15 20:01:02 www kernel: ata2.00: cmd e7/00:00:00:00:00/00:00:00:00:00/a0 tag 0
May 15 20:01:02 www kernel:         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
May 15 20:01:02 www kernel: ata2.00: status: { DRDY }
May 15 20:01:02 www kernel: ata2: hard resetting link
May 15 20:01:03 www kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
May 15 20:01:03 www kernel: ata2.00: configured for UDMA/133
May 15 20:01:03 www kernel: ata2.00: retrying FLUSH 0xe7 Emask 0x4
May 15 20:01:03 www kernel: ata2.00: device reported invalid CHS sector 0
May 15 20:01:03 www kernel: ata2: EH complete

The timing of these errors is peculiar, always a few minutes after the whole hour:

May 15 00:06:02 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 15 10:05:02 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 15 20:01:02 www kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 00:04:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 04:01:02 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 07:02:02 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 07:03:03 www kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 11:02:02 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 12:06:02 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 13:06:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 16 20:04:02 www kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 17 06:03:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 17 09:06:02 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 17 14:04:02 www kernel: ata3.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 17 17:03:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 18 02:02:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 18 10:03:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 18 11:05:03 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 18 13:03:02 www kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 18 16:06:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 18 18:02:01 www kernel: ata1.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
May 19 00:01:02 www kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen

I've tried smartctl but the health test showed PASSED. Also no errors in the log.

SMART Error Log Version: 1
No Errors Logged

The raid info showed this:

cat /proc/mdstat
Personalities : [linear] [raid0] [raid1] [raid10] [raid6] [raid5] [raid4] [multipath] [faulty]
md1 : active raid1 sda1[0] sdb1[1] sdc1[2]
  20478912 blocks [3/3] [UUU]
md2 : active raid1 sda2[0] sdb2[1] sdc2[2]
  96211904 blocks [3/3] [UUU]

Any ideas what to do? The errors seems hardware related, but the timing suggests a software error to me.

Best Answer

Most likely your system is running a cron job every hour, and sometimes it uses data that is located at a bad sector, causing the error message.

You should run smartctl, which is in smartmontools package in Debian / Ubuntu distributions on your hard disk. You can check the logged errors status on the device that way. There should be information about the errors your hard drive has.