Linux – Severe write performance problem

debianlinuxlsiperformanceraid

[Sorry, I tried to stay short but it just was not possible]

I'm running Linux 2.6.22.19 with linux-vserver 2.3.0.34 on a Fujitsu Siemens Primergy RX200 S3 for development purpose. Intel(R) Xeon(R) 5140 @ 2.33Ghz with 4GB RAM (where mostly 500MB are still free). The server has two hot-plug 250GB used in a mirroring RAID configuration:

dev:~# mpt-status --newstyle
ioc:0 vol_id:0 type:IM raidlevel:RAID-1 num_disks:2 size(GB):231 state: OPTIMAL flags: ENABLED
ioc:0 phys_id:0 scsi_id:1 vendor:ATA      product_id:WDC WD2500JS-55N revision:2E01 size(GB):232 state: ONLINE flags: NONE sync_state: 100 ASC/ASCQ:0xff/0xff SMART ASC/ASCQ:0xff/0xff
ioc:0 phys_id:1 scsi_id:8 vendor:ATA      product_id:WDC WD2500JS-55N revision:2E01 size(GB):232 state: ONLINE flags: NONE sync_state: 100 ASC/ASCQ:0xff/0xff SMART ASC/ASCQ:0xff/0xff
scsi_id:0 100%
scsi_id:1 100%

I'm running LVM and ext3.

We've using this machine since around Jul/Aug 2007 and beside a broken RAM which was fixed the same day, there were no problems. And, basically, everything was better then the machine we had before.

However, performance problems were first noticed around Aug 2008 and it was not until recently that I've at least some confidence were the problem comes from. An average of seven vservers is running nowadays (three MySQL machines, two tomcats, three Apaches, Hudson, CruiseControl, MediaWiki, Samba, etc.). But not to get a false impression, we're a small company in terms of developers and other people accessing the server so there's not that much going on (browsing MediaWiki, Hudson automation is run in the evening, most Apache/PHP apps have much static content).

Once I installed munin I started to see interesting things especially during the night. Since within every vserver there was find running (at the same time, everywhere) the load shot up to unreal numbers like 15, 17 or 20.

But ultimately the problem didn't just stay during the night (I started disabling the find jobs, there weren't used anyway) but also during the day, especially when we recently started on a project were we had to work with a database with a 500MB MySQL dump file.

The first time I imported the dump file during working hours (mysql < dump.sql; within one of our vservers where MySQL instance was running) the timed output was:

real    71m28.630s
user    0m15.477s
sys     0m10.185s

Since I wasn't paying attention and was in a meeting, it was just my co-workers asking me what's on with the server because he was terribly slow.

I remade the test during night, installed a vanilla Debian MySQL on the host (not inside a guest; shut down all of them) and hit the following numbers:

real    48m33.067s
user    0m15.397s
sys     0m13.517s

and I still was like Yeah, fine, it's 500MB dump file; dumping into InnoDB space takes up around 1GB, that's quite some data. I did some tests like writing a single line to a file with vim during such a test and capturing it with strace:

   0.000048 write(1, "\33[?25l\"foo\" \33[78;7H\33[K", 22) = 22
   0.000033 stat64("foo", 0xbfda1f04) = -1 ENOENT (No such file or directory)
   0.000033 open("foo", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
   0.000035 write(4, "thios isthis ia a testa\n", 24) = 24
   0.000144 fsync(4)            = 0
7737.631089 stat64("foo", {st_mode=S_IFREG|0664, st_size=24, ...}) = 0
   0.000084 close(4)            = 0
   0.000117 write(1, "\33[78;7H[New] 1L, 24C written", 28) = 28
   0.000051 lseek(3, 0, SEEK_SET) = 0
   0.000022 write(3, "b0VIM 7.0\0\0\0\0\20\0\0\0\0\0\0\0\0\0\0!\21\0\0mark"..., 4096) = 4096
   0.000052 select(1, [0], NULL, [0], {0, 0}) = 1 (in [0], left {0, 0})

This was an unbelievable yet a fact number for me. It seems that stat64 syscall was forced to wait for the dump operation to be finished. Not to mention serving pages in MediaWiki during such a dump takes minutes, too.

Anyway, I arranged a test timeframe with our hosting company to test on our production server during the night and I got a completely different picture:

real    7m4.063s
user    0m2.690s
sys     0m30.500s

I was blown away. I've also got an Amazon EC2 approval for testing and I got even lower numbers (around 5 minutes on a m1.large instance where the MySQL data was written to to the EBS volume to stay permanent).

Additionally, when importing the data, every other operation gets so slow that things just get unusable and the load rises quickly up to 5 or 7 (although not really much seems to be going on; it looks like the processes start to block each other for now reason).

I then started to do bonnie++ tests, which were looking like this (I actually had a test run from last year which I almost forgot). So here is Oct 2008:

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
vserver-host     8G 12332  21 12600   3 10290   2 48519  74 52431   6 235.8   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++ +++++ +++
vserver-host,8G,12332,21,12600,3,10290,2,48519,74,52431,6,235.8,0,16,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++

and here's the current one from Apr 2009:

Version  1.03       ------Sequential Output------ --Sequential Input- --Random-
                    -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks--
Machine        Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP  /sec %CP
vserver-host     8G  9705  16  7268   1  4823   1 29620  45 41336   5  73.1   0
                    ------Sequential Create------ --------Random Create--------
                    -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete--
              files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
                 16 11678  15 +++++ +++ +++++ +++ +++++ +++ +++++ +++ 27739  31
vserver-host,8G,9705,16,7268,1,4823,1,29620,45,41336,5,73.1,0,16,11678,15,+++++,+++,+++++,+++,+++++,+++,+++++,+++,27739,31

I don't know where to tune what to get rid of the problems, as I'm not completely sure yet where/what the real problem is. I guess I started failing to see the wood for the trees.

Update 1:

Thanks for the feedback so far, I've picked the things first I can easily test without scheduling a maintenance time.

First I've tested the MySQL dump against /dev/null:

dev01:~$ time mysqldump -u root -h db01 database-p >/dev/null
Enter password:

real    0m18.871s
user    0m12.713s
sys     0m0.512s

The system load was barely noticeable:

 10:27:18 up 77 days,  9:10,  7 users,  load average: 0.16, 0.75, 0.67
[...]
 10:27:33 up 77 days,  9:10,  7 users,  load average: 0.52, 0.79, 0.69

Also the sar output during this test didn't reveal anything special:

12:11:45 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
12:11:46 PM    dev8-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:11:46 PM  dev254-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:11:46 PM  dev254-1      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00

12:11:46 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
12:11:47 PM    dev8-0      5.00      0.00    200.00     40.00      0.18     36.00     20.00     10.00
12:11:47 PM  dev254-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:11:47 PM  dev254-1     25.00      0.00    200.00      8.00      0.74     29.60      4.00     10.00

Then I ran the tests when dumping onto a file:

dev01:~$ time mysqldump -u root -h db01 database -p >foo.sql

real    1m7.527s
user    0m13.497s
sys     0m2.724s

The time it took doesn't look that unusual for me, the dump ended with a 570MB file.

However, the load was quite interesting …

 10:30:49 up 77 days,  9:14,  7 users,  load average: 0.76, 0.89, 0.75
 10:30:57 up 77 days,  9:14,  7 users,  load average: 1.34, 1.01, 0.79
 10:31:05 up 77 days,  9:14,  7 users,  load average: 2.13, 1.19, 0.85
 10:31:13 up 77 days,  9:14,  7 users,  load average: 2.68, 1.32, 0.89
 10:31:21 up 77 days,  9:14,  7 users,  load average: 3.79, 1.60, 0.99
 10:31:29 up 77 days,  9:14,  7 users,  load average: 4.05, 1.69, 1.02
 10:31:37 up 77 days,  9:14,  7 users,  load average: 4.82, 1.93, 1.10
 10:31:45 up 77 days,  9:15,  7 users,  load average: 4.54, 1.97, 1.12
 10:31:53 up 77 days,  9:15,  7 users,  load average: 4.89, 2.08, 1.16
 10:31:57 up 77 days,  9:15,  7 users,  load average: 5.30, 2.22, 1.21
 10:32:01 up 77 days,  9:15,  7 users,  load average: 5.12, 2.23, 1.22
 10:32:05 up 77 days,  9:15,  7 users,  load average: 5.03, 2.26, 1.24
 10:32:13 up 77 days,  9:15,  7 users,  load average: 4.62, 2.22, 1.23

… as was sar …

12:16:47 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
12:16:48 PM    dev8-0    116.00      0.00  21712.00    187.17    134.04    559.31      8.62    100.00
12:16:48 PM  dev254-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:16:48 PM  dev254-1   3369.00      0.00  26952.00      8.00   3271.74    481.27      0.30    100.00

12:16:48 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
12:16:49 PM    dev8-0    130.00      0.00  17544.00    134.95    143.78    752.89      7.69    100.00
12:16:49 PM  dev254-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:16:49 PM  dev254-1   1462.00      0.00  11696.00      8.00   2749.12   1407.78      0.68    100.00

12:16:49 PM       DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util
12:16:50 PM    dev8-0    128.00      0.00  18400.00    143.75    143.68   1593.91      7.84    100.40
12:16:50 PM  dev254-0      0.00      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:16:50 PM  dev254-1    810.00      0.00   6480.00      8.00   1598.99   4911.94      1.24    100.40

During this tests I quickly fired up vim to do a simple write tests. Loading of vim itself was sluggish too, but I wasn't able to distill this information properly from the strace. Only the pending stat64 call was clearly visible again:

0.000050 stat64("bla", 0xbf98caf4) = -1 ENOENT (No such file or directory)
0.000038 open("bla", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 4
0.000053 write(4, "fooo\n", 5) = 5
0.000051 fsync(4)            = 0
5.385200 stat64("bla", {st_mode=S_IFREG|0664, st_size=5, ...}) = 0
0.000073 close(4)            = 0

I'll append further file system tests ASAP.

Update 2 / Solution:

Based on the feedback from Mihai about MTP devices and RAID configuration I dug into that topic a bit more. Through lspci I retrieved the RAID controller information:

dev:~$ sudo lspci|grep -i lsi
05:05.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1068 PCI-X Fusion-MPT SAS (rev 01)

Out into the blue a gave a shot at http://www.ask.com/web?q=linux+problem+performance+SAS1068 and found write performance problem with LSI Logic SAS1068. The bug description further linked to blog talking about a DELL PE860 performance problem which too uses the SAS1068.

The article mentions the use of LSIUtil which is available from lsi.com. The article also has a extensive walk through the lsiutil menus which described how to turn on write caching.

Not surprisingly, turns out this has a major effect on the performance. Before activating:

dev:~# time (dd if=/dev/zero of=bigfile count=1024 bs=1M; sync)
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 180.902 seconds, 5.9 MB/s

real    3m8.472s
user    0m0.004s
sys     0m3.340s

After enabling the write cache:

dev:~# time (dd if=/dev/zero of=bigfile count=1024 bs=1M; sync)
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 43.7899 seconds, 24.5 MB/s

real    0m46.413s
user    0m0.000s
sys     0m3.124s

This is like day and night, black and white. Sometimes feeling stupid is OK. I'm probably supposed to know that the RAID controller ships with write caching disabled by default.

Update 3:

Rather by accident I found a munin plugin called diskstat which gives detailed IO, read/write/wait, etc. graphs for all block devices. There's also a very nice and detailed blog post from the author, titled Graphing Linux Disk I/O statistics with Munin.

Best Answer

Some good suggestions here from other posters about ruling out the software, and tweaking your RAID performance. It's worth mentioning that if your workload is write-heavy, then going for hardware with a battery-backed write cache is likely to be the right thing to do if you're considering replacing your kit.

Related Solutions

Linux – memory leak? RHEL 5.5. RSS show ok, almost no free memory left, swap used heavily

Check the VmPeak out of /proc:

$ grep ^VmPea /proc/*/status | sort -n -k+2 | tail
/proc/32253/status:VmPeak:         86104 kB
/proc/5425/status:VmPeak:          86104 kB
/proc/9830/status:VmPeak:          86200 kB
/proc/8729/status:VmPeak:          86248 kB
/proc/399/status:VmPeak:           86472 kB
/proc/19084/status:VmPeak:         87148 kB
/proc/13092/status:VmPeak:         88272 kB
/proc/3065/status:VmPeak:         387968 kB
/proc/26432/status:VmPeak:        483480 kB
/proc/31679/status:VmPeak:        611780 kB

This should show which pid has tried to consume the most VM resources and should point at the source of the usage. If you don't see the massive amount of memory in this list then you need to look at the rest of the numbers in /proc/meminfo.

HP DL380 G7 – Poor RAID 10 Performance with Smart Array P410i

Your system is definitely underperforming based on your hardware specifications. I loaded the sysbench utility on a couple of idle HP ProLiant DL380 G6/G7 servers running CentOS 5/6 to check their performance. These are normal fixed partitions instead of LVM. (I don't typically use LVM, because of the flexibility offered by HP Smart Array controllers)

The DL380 G6 has a 6-disk RAID 1+0 array on a Smart Array P410 controller with 512MB of battery-backed cache. The DL380 G7 has a 2-disk enterprise SLC SSD array. The filesystems are XFS. I used the same sysbench command line as you did:

sysbench --init-rng=on --test=fileio --num-threads=16 --file-num=128 --file-block-size=4K --file-total-size=54G --file-test-mode=rndrd --file-fsync-freq=0 --file-fsync-end=off --max-requests=30000 run

My results were 1595 random reads-per-second across 6-disks.
On SSD, the result was 39047 random reads-per-second. Full results are at the end of this post...

As for your setup, the first thing that jumps out at me is the size of your test partition. You're nearly filling the 60GB partition with 54GB of test files. I'm not sure if ext4 has an issue performing at 90+%, but that's the quickest thing for you to modify and retest. (or use a smaller set of test data)
Even with LVM, there are some tuning options available on this controller/disk setup. Checking the read-ahead and changing the I/O scheduler setting from the default cfq to deadline or noop is helpful. Please see the question and answers at: Linux - real-world hardware RAID controller tuning (scsi and cciss)
What is your RAID controller cache ratio? I typically use a 75%/25% write/read balance. This should be a quick test. The 6-disk array completed in 18 seconds. Yours took over 2 minutes.
Can you run a bonnie++ or iozone test on the partition/array in question? It would be helpful to see if there are any other bottlenecks on the system. I wasn't familiar with sysbench, but I think these other tools will give you a better overview of the system's capabilities.
Filesystem mount options may make a small difference, but I think the problem could be deeper than that...

hpacucli output...

Smart Array P410i in Slot 0 (Embedded)    (sn: 50123456789ABCDE)

   array A (SAS, Unused Space: 0 MB)

      logicaldrive 1 (838.1 GB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 300 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 300 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 300 GB, OK)
      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 300 GB, OK)

   SEP (Vendor ID PMCSIERA, Model  SRC 8x6G) 250 (WWID: 50123456789ABCED)

sysbench DL380 G6 6-disk results...

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 16
Initializing random number generator from timer.

Extra file open flags: 0
128 files, 432Mb each
54Gb total file size
Block size 4Kb
Number of random requests for random IO: 30000
Read/Write ratio for combined random IO test: 1.50
Using synchronous I/O mode
Doing random read test
Threads started!
Done.

Operations performed:  30001 Read, 0 Write, 0 Other = 30001 Total
Read 117.19Mb  Written 0b  Total transferred 117.19Mb  (6.2292Mb/sec)
 1594.67 Requests/sec executed

Test execution summary:
    total time:                          18.8133s
    total number of events:              30001
    total time taken by event execution: 300.7545
    per-request statistics:
         min:                                  0.00ms
         avg:                                 10.02ms
         max:                                277.41ms
         approx.  95 percentile:              25.58ms

Threads fairness:
    events (avg/stddev):           1875.0625/41.46
    execution time (avg/stddev):   18.7972/0.01

sysbench DL380 G7 SSD results...

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 16
Initializing random number generator from timer.


Extra file open flags: 0
128 files, 432Mb each
54Gb total file size
Block size 4Kb
Number of random requests for random IO: 30000
Read/Write ratio for combined random IO test: 1.50
Using synchronous I/O mode
Doing random read test
Threads started!
Done.

Operations performed:  30038 Read, 0 Write, 0 Other = 30038 Total
Read 117.34Mb  Written 0b  Total transferred 117.34Mb  (152.53Mb/sec)
39046.89 Requests/sec executed

Test execution summary:
    total time:                          0.7693s
    total number of events:              30038
    total time taken by event execution: 12.2631
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.41ms
         max:                                  1.89ms
         approx.  95 percentile:               0.57ms

Threads fairness:
    events (avg/stddev):           1877.3750/15.59
    execution time (avg/stddev):   0.7664/0.00

Best Answer

Related Solutions

Linux – memory leak? RHEL 5.5. RSS show ok, almost no free memory left, swap used heavily

HP DL380 G7 – Poor RAID 10 Performance with Smart Array P410i

Related Topic