I have a system that is a few years old that is running Fedora 7 (I know, it's old too).
Now I have PS3 Media Server running (and some other things, like Vmware and Plone) and I noticed that I was getting unexpected interruptions in the streamed movies.
Normally this system reads and writes at speeds in the 25-30 MiB/sec. I found out (see below) that at some moments everything stops for 15-30 seconds.
I looked over the other questions and answers here but I was unable to find any hints what could be wrong (software wise).
What I did here was open a movie and seek to a later point. My PS3 uses the media server to get thumbnails for every minute to allow me to seek to the desired scene.
What I see below is that this simple operation grinds the entire system to a halt. The whole system halts waiting for IO. Also the number of context switches drops dramatically to about 650. I checked to see if the system was swapping or otherwise reporting any errors or warnings in the logfiles. I have not been able to spot anything that seems related.
I used iotop to get more info and I just disccovered that all the running applications that were doing a handfull of KiB/sec of IO were suddenly 99% IOWAIT instead of the 0.01% that they were doing normally.
At this moment my best guess is failing hardware (harddisk).
My question to you guys: How do I find out what the real problem is?
System info:
Intel Pentium D CPU 2.80GHz
Linux storage 2.6.22.1-27.fc7 #1 SMP Tue Jul 17 17:13:26 EDT 2007 i686 i686 i386 GNU/Linux
Sample of the output from dstat -f on my system.
-------cpu0-usage--------------cpu1-usage------ --dsk/sda-----dsk/sdb-----dsk/sdc-----dsk/sdd-- ---paging-- ---system-->
usr sys idl wai hiq siq:usr sys idl wai hiq siq| read writ: read writ: read writ: read writ| in out | int csw >
3 12 84 1 0 0: 2 14 83 1 0 0|4096B 68k: 32k 0 : 0 0 : 0 0 | 0 0 | 125 4903 >
0 6 94 0 0 0: 1 8 91 0 0 0| 0 8192B: 16k 0 : 0 0 : 0 0 | 0 0 | 109 3734 >
0 2 96 2 0 0: 0 1 96 0 1 2|4096B 28k:1040k 0 : 0 0 : 0 0 | 0 0 | 998 3969 >
0 2 98 0 0 0: 1 0 97 2 0 0| 0 0 : 32k 0 : 0 0 : 0 0 | 0 0 | 115 3805 >
1 1 98 0 0 0: 0 4 96 0 0 0| 0 0 : 32k 0 : 0 0 : 0 0 | 0 0 | 121 3850 >
1 3 96 0 0 0: 2 4 80 0 2 12| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 |8110 5076 >
1 3 40 55 0 1: 1 2 32 57 1 7| 0 716k:2656k 0 : 0 0 : 0 0 | 0 0 |4213 4562 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 56k: 0 0 : 0 0 : 0 0 | 0 0 | 78 833 >
0 0 85 15 0 0: 0 0 0 100 0 0| 0 16k: 0 0 : 0 0 : 0 0 | 0 0 | 74 623 >
0 0 100 0 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 65 572 >
0 0 14 86 0 0: 0 0 0 100 0 0| 0 56k: 0 0 : 0 0 : 0 0 | 0 0 | 82 717 >
1 0 0 99 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 69 583 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 69 576 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 73 586 >
1 0 0 99 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 70 587 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 70 593 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 70 578 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 64 586 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 66 586 >
1 0 11 88 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 63 575 >
0 1 99 0 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 64 568 >
0 0 79 21 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 65 571 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 64 574 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 65 581 >
1 0 0 99 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 65 579 >
0 0 0 100 0 0: 0 1 0 99 0 0| 0 228k: 0 0 : 0 0 : 0 0 | 0 0 | 115 1230 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 66 579 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 65 574 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 65 583 >
1 0 0 99 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 64 581 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 16k: 0 0 : 0 0 : 0 0 | 0 0 | 67 595 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 64 583 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 65 584 >
0 0 0 100 0 0: 0 0 0 100 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 65 584 >
10 17 8 66 0 0: 0 0 0 100 0 0| 0 4096B: 0 0 : 0 0 : 0 0 | 0 0 | 67 595 >
1 3 17 76 1 2: 1 3 19 72 1 4| 0 3336k:3488k 0 : 0 0 : 0 0 | 0 0 |3386 4124 >
2 3 86 9 0 0: 1 4 72 7 1 15| 0 0 : 11M 0 : 0 0 : 0 0 | 0 0 |8925 5683 >
0 3 94 3 0 0: 1 3 80 4 2 10| 0 0 :6624k 0 : 0 0 : 0 0 | 0 0 |5744 4896 >
1 4 92 3 0 0: 0 7 81 6 1 5|8192B 412k: 448k 0 : 0 0 : 0 0 | 0 0 |2927 4535 >
1 1 97 1 0 0: 1 3 96 0 0 0|8192B 72k: 32k 0 : 0 0 : 0 0 | 0 0 | 131 3931 >
1 2 97 0 0 0: 0 2 98 0 0 0| 0 24k: 0 0 : 0 0 : 0 0 | 0 0 | 112 3889 >
1 16 83 0 0 0: 2 21 74 2 0 1| 12k 32k: 32k 0 : 0 0 : 0 0 | 0 0 | 130 4828 >
1 2 97 0 0 0: 0 2 98 0 0 0| 0 0 : 48k 0 : 0 0 : 0 0 | 0 0 | 127 3838 >
0 1 99 0 0 0: 1 2 97 0 0 0|4096B 44k: 0 0 : 0 0 : 0 0 | 0 0 | 113 3925 >
0 3 95 2 0 0: 0 1 99 0 0 0|8192B 44k: 16k 0 : 0 0 : 0 0 | 0 0 | 136 3973 >
1 1 96 1 0 1: 0 1 99 0 0 0|4096B 40k: 32k 0 : 0 0 : 0 0 | 0 0 | 124 3926 >
0 2 98 0 0 0: 1 1 98 0 0 0| 0 0 : 0 0 : 0 0 : 0 0 | 0 0 | 107 3792 >
Additional info:
iotop normally shows this:
Total DISK READ: 3.82 K/s | Total DISK WRITE: 26.72 K/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
13516 be/2 root 3.82 K/s 11.45 K/s 0.00 % 1.44 % vmware-vmx -C /personal/VMWare
13514 be/2 root 0.00 B/s 7.63 K/s 0.00 % 0.01 % vmware-vmx -C /personal/VMWare
13517 be/2 root 0.00 B/s 7.63 K/s 0.00 % 0.01 % vmware-vmx -C /personal/VMWare
13519 be/4 root 0.00 B/s 26.72 K/s 0.00 % 0.00 % vmware-vmx -C /personal/VMWare
1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init [5]
2 be/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd]
3 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0]
4 be/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0]
5 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/0]
6 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/1]
7 be/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/1]
8 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/1]
9 be/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [events/0]
When blocked it shows this:
Total DISK READ: 0.00 B/s | Total DISK WRITE: 0.00 B/s
TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND
3013 be/4 nobody 0.00 B/s 0.00 B/s 0.00 % 99.99 % httpd.vmware -DSSL -DSSL_ONLY
9828 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % vmware-vmx -C /personal/VMWare/
16464 be/4 niels 0.00 B/s 0.00 B/s 0.00 % 99.99 % java -Xmx768M -Djava.encoding=UT~e/niels/pms-linux-1.10.5/pms.jar
14013 be/4 niels 0.00 B/s 0.00 B/s 0.00 % 99.99 % java -Xmx768M -Djava.encoding=UT~e/niels/pms-linux-1.10.5/pms.jar
13516 be/2 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % vmware-vmx -C /personal/VMWare/
13555 be/2 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % vmware-vmx -C /personal/VMWare/
1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init [5]
2 be/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd]
3 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0]
4 be/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0]
5 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/0]
6 rt/3 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/1]
So there is no actual IO to wait for and still everything is waiting for their IO.
Best Answer
Have you tried iotop? It lists processes trying to do io.
Also try smartctl -a /dev/sd* to see if there are any S.M.A.R.T. errors.