I have a server on which resides a fairly visited web app.
It has a raid1 of 2 HDDs, 64MB Buffer, 7200 RPM.
Today it started throwing out errors like:
kernel: ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
kernel: ata2.00: cmd b0/d0:01:00:4f:c2/00:00:00:00:00/00 tag 0 pio 512 in
kernel: res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
kernel: ata2.00: status: { DRDY }
kernel: ata2: hard resetting link
kernel: ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
kernel: ata2.00: max_sectors limited to 256 for NCQ
kernel: ata2.00: max_sectors limited to 256 for NCQ
kernel: ata2.00: configured for UDMA/133
kernel: sd 1:0:0:0: timing out command, waited 7s
kernel: ata2: EH complete
kernel: SCSI device sda: 976773168 512-byte hdwr sectors (500108 MB)
kernel: sda: Write Protect is off
kernel: SCSI device sda: drive cache: write back
All day long it has been in load higher than 10-15.
I am monitoring it with atop and it gives some bizarre readings:
DSK | sda | busy 100% | read 2 | write 208 | KiB/r 16 | KiB/w 32 | MBr/s 0.00 | MBw/s 0.65 | avq 86.17 | avio 47.6 ms |
DSK | sdb | busy 1% | read 10 | write 117 | KiB/r 17 | KiB/w 5 | MBr/s 0.02 | MBw/s 0.07 | avq 4.86 | avio 1.04 ms |
I frankly don't understand why only sda is taking all the hit. I do have one process that is constantly writing with 1-2megs but what the hell.. 100% iowait?
Update:
smartctl -A /dev/sda
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x002f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0027 239 239 021 Pre-fail Always - 8050
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 22
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x002e 200 200 000 Old_age Always - 0
9 Power_On_Hours 0x0032 100 100 000 Old_age Always - 595
10 Spin_Retry_Count 0x0032 100 253 000 Old_age Always - 0
11 Calibration_Retry_Count 0x0032 100 253 000 Old_age Always - 0
12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 21
192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 20
193 Load_Cycle_Count 0x0032 200 200 000 Old_age Always - 22
194 Temperature_Celsius 0x0022 118 106 000 Old_age Always - 32
196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0
197 Current_Pending_Sector 0x0032 200 200 000 Old_age Always - 0
198 Offline_Uncorrectable 0x0030 100 253 000 Old_age Offline - 0
199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age Always - 0
200 Multi_Zone_Error_Rate 0x0008 100 253 000 Old_age Offline - 0
Best Answer
It looks like the drive may be failing - the system is resetting the drive / connection trying to get through to it. I'm guessing since you can see the status of both sda and sdb it is a software raid so you should be able to check /proc/mdstat to see what may be going on with the software raid.
The IOWait is because the software raid is being held up waiting to write to sda. This in turn is holding up processes and causing the high load number.
You should replace your sda ASAP.