Following scenario has happened twice with different RAID controllers. One was LSI MegaRAID running RAID5, the second was HP Smart Array E200i running RAID1. At first server works smoothly for few years. Then people start complaining about performance. Then it turns out its not just "application problem" because simple disk operations (like ls on a directory with 20-30 files) can take up to 5 seconds. Here is what vmstat reports during a heavy workload:
procs -----------memory------------ ---swap-- -----io---- -system-- ----cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa
1 8 8944 126004 20 1597500 0 0 1666 5935 282 833 10 3 0 86
1 16 8944 122276 20 1599636 0 0 612 6300 314 615 10 3 0 87
1 12 8944 123740 20 1599332 0 0 811 5103 188 794 2 2 0 96
0 19 8944 121916 20 1600808 0 0 150 7299 163 858 1 1 0 97
0 16 8944 239244 20 1612256 0 0 647 2522 156 798 0 1 0 99
0 6 8944 215308 20 1643712 0 0 3030 3060 201 956 33 5 0 62
1 13 8944 186352 20 1672540 0 0 143 6173 166 931 14 8 0 78
8 2 8944 137368 20 1710432 0 0 111 6425 171 833 48 4 0 48
1 11 8944 122500 20 1725892 0 0 306 5222 153 746 69 4 0 27
24 13 8944 128444 20 1729680 0 0 380 5210 170 4484 16 6 8 70
0 4 8944 124956 20 1731228 0 0 389 4933 272 761 4 2 0 93
0 6 8944 123004 20 1735780 0 0 15 7856 209 682 1 2 7 90
So the server is withdrawn from production usage and tested with bonnie++ and monitored with vmstat which gives pretty much the same results. So it would seem that the disks are faulty. However when querying RAID controller it appears that both logical drive and physical disks are ok. Also kernel logs do not contain any message that can suggest a problem with disk operations.
So my question is: how do I debug further this problem? Do i have to replace controller/disks and simply see after which replacement situation got better? Or perhaps some command can be executed and its results studied to pinpoint exact location of the problem?
Best Answer
can it be that write-cache was turned off? maybe battery has died and it switched from write-back to write-through?
some cheap hardware raids without battery and with cache by default enable the cache just for reads - can it be that you set it to use write-cache too and the controller 'lost' the settings?
besides - maybe one of the drives is faulty? try looking at raid logs [MegaCli command line tool should help].