A Hudson server used by developers is having some timeout issues.
Whenever Java is compiling (git is checking the modules, java classes are compiled, tested and copied to Nexus), something goes wrong in the process, and everything becomes unresponsive.
I am able to watch iotop for example during the first 15-30 seconds, it peaks at 3Mb/s at most, and then, all of a sudden, top stops refreshing and I can't do anything else. Idem dito for SNMP, every SNMP requests results in a time out.
I am also unable to log in to ssh, and typing a simple command like "w" in an open terminal takes ages (5+ minutes) to come up with output.
This is a Debian server (pretty default installation, nothing odd running on it).
This is a VM running on a XenServer host. It has 4Gb memory allocated and 8 cores (4 physical cores + HT).
XenCenter is reporting 13% CPU load (which is 100% on one core). Disk usage is low, and memory isn't peaking either (it stays at about 3Gb out of 4Gb used, swap stays empty).
This is 'w' output just after it becomes responsive again:
$ w
12:02:31 up 10 days, 19:45, 4 users, load average: 32.13, 25.42, 15.32
So yes, the average system load is quite high… But even 3200 doesn't seem that high.
It also doesn't seem to happen each time it's building something.
My question: is there a way to sample data and see what's causing everything to hang, and even if Java is monopolising one core, the server has 7 unused cores, why does it fail to respond to any input (as IO-usage seems low too, and it's not running out of memory)?
Edit:
vmstat 1 output:
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
3 0 940 725336 235388 1444280 0 0 0 0 2048 578 14 8 79 0
5 0 940 702704 235392 1444296 0 0 0 200 3088 1691 8 8 81 3
2 0 940 701480 235392 1444296 0 0 0 4 2116 658 13 7 80 0
3 0 940 699480 235392 1444296 0 0 0 68 2688 1108 10 7 82 1
2 0 940 657732 235396 1444292 0 0 0 244 4639 3412 11 5 81 2
0 2 940 648872 235408 1451840 0 0 0 6616 5283 4573 11 3 79 6
0 2 940 645696 235412 1455296 0 0 0 4436 1257 1134 3 1 74 22
1 1 940 635064 235424 1465320 0 0 0 6940 3117 2653 6 1 87 6
0 1 940 632792 235448 1466592 0 0 0 9236 610 405 2 1 81 16
1 1 940 631464 235448 1466660 0 0 0 9524 1580 852 4 3 73 20
1 1 940 633428 235448 1466660 0 0 0 9120 2931 1250 1 6 82 11
3 0 940 620844 235540 1467496 0 0 0 48 2249 1368 16 4 80 0
3 1 940 590080 235544 1467552 0 0 0 1632 2322 918 9 10 76 6
4 0 940 538072 235544 1467552 0 0 0 28 2159 2364 15 7 77 1
5 0 940 489588 235548 1467836 0 0 0 1412 1745 663 16 5 76 2
7 0 940 402368 235548 1467996 0 0 0 0 2573 882 16 7 77 0
3 0 940 371424 235560 1468128 0 0 0 20 2235 744 14 7 79 0
4 0 940 265660 235568 1467840 0 0 0 0 2181 826 13 6 82 0
4 0 940 257360 235604 1468588 0 0 0 0 1718 705 14 6 81 0
3 0 940 193800 235628 1468124 0 0 0 1312 2164 1096 12 5 80 3
** hangs after this **
PS just when it became responsive again.
develop@Yong:/opt/develop/tomcat7-develop/bin$ ps -HAf
UID PID PPID C STIME TTY TIME CMD
root 2 0 0 Jan21 ? 00:00:00 [kthreadd]
root 3 2 0 Jan21 ? 00:00:00 [migration/0]
root 4 2 0 Jan21 ? 00:00:00 [ksoftirqd/0]
root 5 2 0 Jan21 ? 00:00:00 [watchdog/0]
root 6 2 0 Jan21 ? 00:00:00 [migration/1]
root 7 2 0 Jan21 ? 00:00:00 [ksoftirqd/1]
root 8 2 0 Jan21 ? 00:00:00 [watchdog/1]
root 9 2 0 Jan21 ? 00:00:00 [migration/2]
root 10 2 0 Jan21 ? 00:00:00 [ksoftirqd/2]
root 11 2 0 Jan21 ? 00:00:00 [watchdog/2]
root 12 2 0 Jan21 ? 00:00:00 [migration/3]
root 13 2 0 Jan21 ? 00:00:00 [ksoftirqd/3]
root 14 2 0 Jan21 ? 00:00:00 [watchdog/3]
root 15 2 0 Jan21 ? 00:00:01 [migration/4]
root 16 2 0 Jan21 ? 00:00:00 [ksoftirqd/4]
root 17 2 0 Jan21 ? 00:00:00 [watchdog/4]
root 18 2 0 Jan21 ? 00:00:02 [migration/5]
root 19 2 0 Jan21 ? 00:00:00 [ksoftirqd/5]
root 20 2 0 Jan21 ? 00:00:00 [watchdog/5]
root 21 2 0 Jan21 ? 00:00:01 [migration/6]
root 22 2 0 Jan21 ? 00:00:00 [ksoftirqd/6]
root 23 2 0 Jan21 ? 00:00:00 [watchdog/6]
root 24 2 0 Jan21 ? 00:00:00 [migration/7]
root 25 2 0 Jan21 ? 00:00:00 [ksoftirqd/7]
root 26 2 0 Jan21 ? 00:00:00 [watchdog/7]
root 27 2 0 Jan21 ? 00:00:00 [events/0]
root 28 2 0 Jan21 ? 00:01:04 [events/1]
root 29 2 0 Jan21 ? 00:00:00 [events/2]
root 30 2 0 Jan21 ? 00:00:00 [events/3]
root 31 2 0 Jan21 ? 00:00:00 [events/4]
root 32 2 0 Jan21 ? 00:00:00 [events/5]
root 33 2 0 Jan21 ? 00:00:00 [events/6]
root 34 2 0 Jan21 ? 00:00:00 [events/7]
root 35 2 0 Jan21 ? 00:00:00 [cpuset]
root 36 2 0 Jan21 ? 00:00:00 [khelper]
root 37 2 0 Jan21 ? 00:00:00 [netns]
root 38 2 0 Jan21 ? 00:00:00 [async/mgr]
root 39 2 0 Jan21 ? 00:00:00 [pm]
root 40 2 0 Jan21 ? 00:00:00 [xenwatch]
root 41 2 0 Jan21 ? 00:00:00 [xenbus]
root 42 2 0 Jan21 ? 00:00:00 [sync_supers]
root 43 2 0 Jan21 ? 00:00:00 [bdi-default]
root 44 2 0 Jan21 ? 00:00:00 [kintegrityd/0]
root 45 2 0 Jan21 ? 00:00:00 [kintegrityd/1]
root 46 2 0 Jan21 ? 00:00:00 [kintegrityd/2]
root 47 2 0 Jan21 ? 00:00:00 [kintegrityd/3]
root 48 2 0 Jan21 ? 00:00:00 [kintegrityd/4]
root 49 2 0 Jan21 ? 00:00:00 [kintegrityd/5]
root 50 2 0 Jan21 ? 00:00:00 [kintegrityd/6]
root 51 2 0 Jan21 ? 00:00:00 [kintegrityd/7]
root 52 2 0 Jan21 ? 00:00:00 [kblockd/0]
root 53 2 0 Jan21 ? 00:00:00 [kblockd/1]
root 54 2 0 Jan21 ? 00:00:00 [kblockd/2]
root 55 2 0 Jan21 ? 00:00:00 [kblockd/3]
root 56 2 0 Jan21 ? 00:00:00 [kblockd/4]
root 57 2 0 Jan21 ? 00:00:00 [kblockd/5]
root 58 2 0 Jan21 ? 00:00:00 [kblockd/6]
root 59 2 0 Jan21 ? 00:00:00 [kblockd/7]
root 60 2 0 Jan21 ? 00:00:00 [kseriod]
root 69 2 0 Jan21 ? 00:00:00 [kondemand/0]
root 70 2 0 Jan21 ? 00:00:00 [kondemand/1]
root 71 2 0 Jan21 ? 00:00:00 [kondemand/2]
root 72 2 0 Jan21 ? 00:00:00 [kondemand/3]
root 73 2 0 Jan21 ? 00:00:00 [kondemand/4]
root 74 2 0 Jan21 ? 00:00:00 [kondemand/5]
root 75 2 0 Jan21 ? 00:00:00 [kondemand/6]
root 76 2 0 Jan21 ? 00:00:00 [kondemand/7]
root 77 2 0 Jan21 ? 00:00:00 [khungtaskd]
root 78 2 0 Jan21 ? 00:00:00 [kswapd0]
root 79 2 0 Jan21 ? 00:00:00 [ksmd]
root 80 2 0 Jan21 ? 00:00:00 [aio/0]
root 81 2 0 Jan21 ? 00:00:00 [aio/1]
root 82 2 0 Jan21 ? 00:00:00 [aio/2]
root 83 2 0 Jan21 ? 00:00:00 [aio/3]
root 84 2 0 Jan21 ? 00:00:00 [aio/4]
root 85 2 0 Jan21 ? 00:00:00 [aio/5]
root 86 2 0 Jan21 ? 00:00:00 [aio/6]
root 87 2 0 Jan21 ? 00:00:00 [aio/7]
root 88 2 0 Jan21 ? 00:00:00 [crypto/0]
root 89 2 0 Jan21 ? 00:00:00 [crypto/1]
root 90 2 0 Jan21 ? 00:00:00 [crypto/2]
root 91 2 0 Jan21 ? 00:00:00 [crypto/3]
root 92 2 0 Jan21 ? 00:00:00 [crypto/4]
root 93 2 0 Jan21 ? 00:00:00 [crypto/5]
root 94 2 0 Jan21 ? 00:00:00 [crypto/6]
root 95 2 0 Jan21 ? 00:00:00 [crypto/7]
root 98 2 0 Jan21 ? 00:00:00 [khvcd]
root 408 2 0 Jan21 ? 00:00:06 [kjournald]
root 906 2 0 Jan21 ? 00:00:03 [flush-202:16]
root 1111 2 0 Jan21 ? 00:00:00 [kslowd000]
root 1112 2 0 Jan21 ? 00:00:00 [kslowd001]
root 1119 2 0 Jan21 ? 00:00:00 [cifsd]
root 1 0 0 Jan21 ? 00:00:00 init [2]
root 480 1 0 Jan21 ? 00:00:00 udevd --daemon
daemon 1129 1 0 Jan21 ? 00:00:00 /sbin/portmap
statd 1140 1 0 Jan21 ? 00:00:00 /sbin/rpc.statd
root 1236 1 0 Jan21 ? 00:00:01 /usr/sbin/rsyslogd -c3
root 1283 1 0 Jan21 ? 00:00:27 /bin/bash /usr/sbin/xe-daemon
root 22213 1283 0 13:57 ? 00:00:00 sleep 60
root 1358 1 0 Jan21 ? 00:00:00 /usr/sbin/sshd
root 18625 1358 0 13:33 ? 00:00:00 sshd: yeri [priv]
yeri 18628 18625 0 13:33 ? 00:00:00 sshd: yeri@pts/0
yeri 18629 18628 0 13:33 pts/0 00:00:00 -bash
root 18647 18629 0 13:33 pts/0 00:00:00 su -
root 18648 18647 0 13:33 pts/0 00:00:00 -su
root 19357 18648 0 13:45 pts/0 00:00:00 vmstat 1
root 19124 1358 0 13:44 ? 00:00:00 sshd: develop [priv]
develop 19127 19124 0 13:44 ? 00:00:00 sshd: develop@pts/1
develop 19128 19127 0 13:44 pts/1 00:00:00 -bash
develop 22174 19128 0 13:57 pts/1 00:00:00 ps -HAf
root 22163 1358 0 13:53 ? 00:00:00 sshd: andy [priv]
andy 22166 22163 0 13:54 ? 00:00:00 sshd: andy@pts/2
andy 22167 22166 0 13:54 pts/2 00:00:00 -bash
andy 22216 22167 0 13:57 pts/2 00:00:00 -bash
andy 22217 22216 0 13:57 pts/2 00:00:00 [sed]
andy 22218 22216 0 13:57 pts/2 00:00:00 -bash
root 1677 1 0 Jan21 ? 00:00:00 nginx: master process /usr/sbi
www-data 1678 1677 0 Jan21 ? 00:00:00 nginx: worker process
snmp 1686 1 0 Jan21 ? 00:02:15 /usr/sbin/snmpd -Lsd -Lf /dev/
ntp 1700 1 0 Jan21 ? 00:00:00 /usr/sbin/ntpd -p /var/run/ntp
daemon 1718 1 0 Jan21 ? 00:00:00 /usr/sbin/atd
root 1739 1 0 Jan21 ? 00:00:02 /usr/sbin/cron
root 22159 1739 0 13:50 ? 00:00:00 /USR/SBIN/CRON
root 22161 22159 0 13:50 ? 00:00:00 /usr/bin/perl -w /usr/bin/
root 22214 22161 0 13:57 ? 00:00:00 [rateup]
daemon 1763 1 0 Jan21 ? 00:00:03 /usr/sbin/uptimed
root 1765 1 0 Jan21 tty2 00:00:00 /sbin/getty 38400 tty2
root 1766 1 0 Jan21 tty3 00:00:00 /sbin/getty 38400 tty3
root 1767 1 0 Jan21 tty4 00:00:00 /sbin/getty 38400 tty4
root 1768 1 0 Jan21 tty5 00:00:00 /sbin/getty 38400 tty5
root 1769 1 0 Jan21 tty6 00:00:00 /sbin/getty 38400 tty6
104 7240 1 0 Jan24 ? 00:00:00 /usr/bin/dbus-daemon --system
avahi 7269 1 0 Jan24 ? 00:00:00 avahi-daemon: running [Yong.lo
avahi 7270 7269 0 Jan24 ? 00:00:00 avahi-daemon: chroot helper
101 423 1 0 Jan31 ? 00:00:00 /usr/sbin/exim4 -bd -q30m
root 6799 1 0 Jan31 hvc0 00:00:00 /sbin/getty 38400 hvc0
root 1915 1 0 06:25 ? 00:00:01 /usr/bin/atop -a -w /var/log/a
root 19181 1 0 13:45 pts/0 00:00:00 /bin/sh /usr/bin/mysqld_safe
mysql 19218 19181 0 13:45 pts/0 00:00:03 /usr/sbin/mysqld --basedir=/
root 19219 19181 0 13:45 pts/0 00:00:00 logger -p daemon.err -t mysq
develop 19304 1 99 13:45 pts/1 66-17:02:10 /opt/develop/java/jdk/bin/j
develop 19457 19304 99 13:46 pts/1 373-12:31:17 /opt/develop/java/jdk1.6
develop 22219 19457 0 13:57 pts/1 00:00:00 [java]
develop 21193 19304 99 13:47 pts/1 26-16:29:39 /opt/develop/java/jdk1.6.
develop 22145 19304 0 13:48 pts/1 00:00:00 git fetch -t ssh://git@git.m
develop 22148 22145 0 13:48 pts/1 00:00:00 ssh git@git.miiatech.com g
develop 22215 22145 0 13:57 pts/1 00:00:00 [git]
Best Answer
the output of "w" doesn't help here very much. Please post the output of "vmstat 1" during the first 15-30 seconds, it gives a much deeper insight. How many files are in your java project? Is everything available locally? Or is hudson checking the stuff out from elsewhere or copying something from the network?