Ubuntu – Is slow IO on the cloud server causing the slow server restart issues


I'm running two servers on Rackspace cloud one for the web app and one for the db and redis instances. The web server has 1Gb of ram and single core. Nginx sits in front of unicorn which is running 2 workers. I also have a sidekiq instance running. This configuration runs great and the servers generally hum along at very low cpu as the app hasn't been launched yet.

However, when I do a unicorn restart, let alone a full app deploy, all hell breaks loose. It looks something like this:
This isn't a great graph

Basically my server gets wiped out for 3 minutes. It's somewhat responsive at times but monitoring is triggering downtime alerts all over the place (this is just zero-downtime restart).

If I do a full deploy, the graph spans about 8 minutes even though I'm precompiling assets and uploading, so no on-server compilation.

The interesting part for me is that I have an exact duplicate server setup running on DigitalOcean. I can completely restart that entire server shutdown -r and be up and serving pages in 50 seconds. With this Rackspace server I daren't restart it even to test as it would give me very significant downtime for a production server.

I'm not a Linux server admin, so I'm wondering if people can tell me if this is par for the course with Rackspace cloud servers. I've had a decade of experience running a few dedicated Windows boxes and never had any issues like this.

hdparm against the servers.


$ sudo hdparm -Tt /dev/xvdc
Timing cached reads:   5066 MB in  1.99 seconds = 2541.54 MB/sec
Timing buffered disk reads: 238 MB in  3.00 seconds =  79.32 MB/sec 


$ sudo hdparm -Tt /dev/vda 
Timing cached reads:   15612 MB in  1.99 seconds = 7828.02 MB/sec
Timing buffered disk reads: 1416 MB in  3.00 seconds = 471.89 MB/sec

Obviously the DO server is outstripping the RS server by a significant margin. Interestingly enough, the DO server is actually staging two apps so is doing more work than RS one. Both hdparms run with server load about the same (ie very little). Is this purely slow disk speed or is something else going on here?

top for both servers


 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND                                                                                       
 9832 xxxxxxxx  20   0  525m 214m 4372 S  0.0 21.6   1:31.61 ruby                                                                                           
 9829 xxxxxxxx  20   0  443m 205m 3312 S  0.0 20.6   1:27.67 ruby                                                                                           
15597 xxxxxxxx  20   0  554m 176m 1268 S  0.0 17.8   4:59.36 ruby                                                                                           
 9780 xxxxxxxx  20   0  443m  63m 1088 S  0.0  6.4   0:28.80 ruby                                                                                           
  787 root      20   0  193m  17m 2608 S  2.0  1.7 350:43.06 driveclient                                                                                    
 1556 xxxxxxxx  20   0 77876  11m 1020 S  0.0  1.1  18:54.78 remote_syslog                                                                                  
17415 root      20   0 73096 3364 2608 S  0.0  0.3   0:00.03 sshd

Digital Ocean

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND                                                                
20921 xxxxxxxx  20   0  240m 191m 5328 S   0.0 19.1   0:29.62 ruby                                                                   
21009 xxxxxxxx  20   0  204m 178m 5356 S   0.0 17.8   0:20.82 ruby                                                                   
21194 xxxxxxxx  20   0  204m 174m 1724 S   0.0 17.4   0:00.10 ruby                                                                   
21206 xxxxxxxx  20   0  204m 174m 1656 S   0.0 17.4   0:00.10 ruby                                                                   
21181 xxxxxxxx  20   0 98.3m  89m 2184 S   0.3  8.9   0:03.04 ruby                                                                   
 1426 xxxxxxxx  20   0  117m  40m 2272 S   0.0  4.1   1:09.02 ruby                                                                   
 1429 xxxxxxxx  20   0  117m  29m 2180 S   0.0  3.0   1:09.64 ruby                                                                   
 1422 xxxxxxxx  20   0  117m 4652 1172 S   0.0  0.5   0:08.08 ruby                                                                   
22066 xxxxxxxx  20   0  7188 3456 1512 S   0.0  0.3   0:00.09 bash                                                                   
22008 root      20   0 10008 3320 2664 S   0.0  0.3   0:00.03 sshd

Should I be ditching Rackspace?

Deploy graph (excluding the file upload and decompression of precompiled assets)

This is not the graph of a happy deploy

Edit: vmstat

$ vmstat -S M 1 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 1  0    380     67     13    109    4    4    13    10   10   17  1  1 97  0
 0  0    380     67     13    109    0    0     0     0  650 1011  0  1 99  0
 0  0    380     67     13    109    0    0     0     0  675 1008  0  1 99  0
 0  0    380     67     13    109    0    0     0     0  659 1009  0  0 100  0
 1  0    380     67     13    109    0    0     0    68  661 1027  0  0 99  1
 0  0    380     67     13    109    0    0     0     0  667 1014  0  0 100  0
 1  0    380     67     13    109    0    0     0     0  671 1016  1  0 99  0
 0  0    380     67     13    109    0    0     0     0  668 1008  0  0 99  0
 0  0    380     67     13    109    0    0     0     0  671 1022  0  0 100  0
 0  0    380     67     13    109    0    0     0     0  783 1112  9  3 89  0     

Best Answer

I work at Rackspace and we would like to help you resolve this issue. If you could give us a call at 1-800-961-4454 we can check the health of the host that your server is on and move it to a new one if it does appear to be a noisy neighbor issue. I would also be interested in seeing the output of 'vmstat -S M 1 10', 'sar -b' (after some time has passed) and perhaps 'iostat -x /dev/xvdc 2 6' when this issue is occurring.

