HP DL380p Gen8 (p420i controller) I/O oddity on XFS partitions

centos6hphp-smart-arrayrhel6xfs

On DL380p gen8 servers using XFS on top of LVM on top of raid 1+0 with 6 disks, an identical workload results in a ten-fold increase in disk writes on RHEL 6 compared to RHEL 5, making applications unusable.

Note that I'm not looking at optimizing the co6 system as much as possible, but at understanding why co6 behaves so wildly different, and solving that.

vmstat/iostat

We have a MySQL replication setup, using mysql 5.5. Mysql slaves on gen8 servers using RHEL 6 as OS perform badly, inspection with vmstat and iostat shows that these servers do ten times the page out activity and ten times the amount of writes to the disk subsystem. blktrace show that these writes are not initiated by mysql, but by the kernel.

Centos 5:

[dkaarsemaker@co5 ~]$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 3  0     12 252668 102684 10816864    0    0     8   124    0    0  9  1 90  0  0
 1  0     12 251580 102692 10817116    0    0    48  2495 3619 5268  6  1 93  0  0
 3  0     12 252168 102692 10817848    0    0    32  2103 4323 5956  6  1 94  0  0
 3  0     12 252260 102700 10818672    0    0   128  5212 5365 8142 10  1 89  0  0

[dkaarsemaker@co5 ~]$ iostat 1
Linux 2.6.18-308.el5 (bc290bprdb-01.lhr4.prod.booking.com)  02/28/2013

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           8.74    0.00    0.81    0.25    0.00   90.21

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
cciss/c0d0      277.76       399.60      5952.53 2890574849 43058478233
cciss/c0d0p1      0.01         0.25         0.01    1802147      61862
cciss/c0d0p2      0.00         0.01         0.00     101334      32552
cciss/c0d0p3    277.75       399.34      5952.52 2888669185 43058383819
dm-0             32.50        15.00       256.41  108511602 1854809120
dm-1            270.24       322.97      5693.34 2336270565 41183532042

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           7.49    0.00    0.79    0.08    0.00   91.64

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
cciss/c0d0      300.00        32.00      4026.00         32       4026
cciss/c0d0p1      0.00         0.00         0.00          0          0
cciss/c0d0p2      0.00         0.00         0.00          0          0
cciss/c0d0p3    300.00        32.00      4026.00         32       4026
dm-0              0.00         0.00         0.00          0          0
dm-1            300.00        32.00      4026.00         32       4026

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           4.25    0.00    0.46    0.21    0.00   95.09

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
cciss/c0d0      507.00       160.00     10370.00        160      10370
cciss/c0d0p1      0.00         0.00         0.00          0          0
cciss/c0d0p2      0.00         0.00         0.00          0          0
cciss/c0d0p3    507.00       160.00     10370.00        160      10370
dm-0              0.00         0.00         0.00          0          0
dm-1            507.00       160.00     10370.00        160      10370

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           5.33    0.00    0.50    0.08    0.00   94.09

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
cciss/c0d0      318.00        64.00      4559.00         64       4559
cciss/c0d0p1      0.00         0.00         0.00          0          0
cciss/c0d0p2      0.00         0.00         0.00          0          0
cciss/c0d0p3    319.00        64.00      4561.00         64       4561
dm-0              0.00         0.00         0.00          0          0
dm-1            319.00        64.00      4561.00         64       4561

And on Centos 6 a ten-fold increase in paged out and disk writes:

[root@co6 ~]# vmstat 1
procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 361044  52340 81965728    0    0    19  1804   36  110  1  1 98  0  0  
 0  0      0 358996  52340 81965808    0    0   272 57584 1211 3619  0  0 99  0  0  
 2  0      0 356176  52348 81966800    0    0   240 34128 2121 14017  1  0 98  0  0 
 0  1      0 351844  52364 81968848    0    0  1616 29128 3648 3985  1  1 97  1  0  
 0  0      0 353000  52364 81969296    0    0   480 44872 1441 3480  1  0 99  0  0  

[root@co6 ~]# iostat 1
Linux 2.6.32-279.22.1.el6.x86_64 (bc291bprdb-01.lhr4.prod.booking.com)  02/28/2013  _x86_64_    (32 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           1.08    0.00    0.67    0.27    0.00   97.98

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             373.48      1203.02    115203.05   11343270 1086250748
dm-0             63.63        74.92       493.63     706418    4654464
dm-1            356.48      1126.72    114709.47   10623848 1081596740

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.25    0.00    0.19    0.06    0.00   99.50

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             330.00        80.00     77976.00         80      77976
dm-0              0.00         0.00         0.00          0          0
dm-1            328.00        64.00     77456.00         64      77456

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.38    0.00    0.19    0.63    0.00   98.81

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             570.00      1664.00    128120.00       1664     128120
dm-0              0.00         0.00         0.00          0          0
dm-1            570.00      1664.00    128120.00       1664     128120

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
           0.66    0.00    0.47    0.03    0.00   98.84

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda             317.00       448.00     73048.00        448      73048
dm-0             34.00         0.00       272.00          0        272
dm-1            309.00       448.00     72776.00        448      72776

Narrowing down

Gen 8 servers using RHEL 5, and gen 7 servers using RHEL 5 or 6 do not show
this problem. Furthermore, RHEL 6 with ext3 as filesystem instead of our
default xfs does not show the problem. The problem really seems to be somewhere between XFS, gen8 hardware and centos 6. RHEL 6 also shows the problem.

Edit 29/04: we added qlogic HBA's t the G8 machine. Using XFS on fibre channel storage does not show the problem. So it's definitely somewhere in the interaction between xfs/hpsa/p420i.

XFS

The newer xfs in rhel 8 seems to be able to detect underlying stripe width, but
only on p420i controllers using the hpsa driver, not p410i controllers using
cciss.

xfs_info output:

[root@co6 ~]# xfs_info /mysql/bp/
meta-data=/dev/mapper/sysvm-mysqlVol isize=256    agcount=16, agsize=4915136 blks
         =                       sectsz=512   attr=2
data     =                       bsize=4096   blocks=78642176, imaxpct=25
         =                       sunit=64     swidth=192 blks
naming   =version 2              bsize=4096   ascii-ci=0
log      =internal               bsize=4096   blocks=38400, version=2
         =                       sectsz=512   sunit=64 blks, lazy-count=1
realtime =none                   extsz=4096   blocks=0, rtextents=0

sunit/swidth are both 0 in all the setup marked as OK above. We seem to be
unable to change this, either in mkfs or with the noalign mount option. We also
don't know if this is the cause.

Hugepages

Other people having XFS problems on rhel 6, say that disabling hugepages, and
especially transparent hugepages can be beneficial. We disabled both, the
problem did not go away.

We tried and observed many things already, none of the following have helped:

Using numactl to influence memory allocations. We noticed that g7 and g8 have a different numa layout, no effect was seen
Newer kernels (as new as 3.6) did not seem to solve this. Neither did using fedora 17.
iostat does not report a ten-fold increase in write transactions, only in number of bytes written
Using different I/O schedulers has no effect.
Mounting the relevant filesystem noatime/nobarrier/nopdiratime did not help
Changing /proc/sys/vm/dirty_ratio had no effect
This happens both on systems based on 2640 and 2670 CPU's
hpsa-3.2.0 doesn't fix the problem

Best Answer

XFS and EL6 have fallen into an ugly state... I've abandoned XFS on EL6 systems for the time being due to several upstream features/changes slipping into the Red Hat kernel...

This one was a surprise and caused some panic: Why are my XFS filesystems suddenly consuming more space and full of sparse files?

Since November 2012, the XFS version shipping in kernels newer than 2.6.32-279.11.1.el6 have an annoying load and performance issue stemming from Red Hat Bugzilla 860787. Since then, I've had unpredictable performance and higher run queues than average.

For new systems, I'm using ZFS or just ext4. For older systems, I'm freezing them at 2.6.32-279.11.1.el6.

Try rolling back to that version with:

yum install kernel-2.6.32-279.11.1.el6.x86_64

In addition to the above, due to the type of RAID controller you're using, the typical optimizations are in order:

Mount your XFS filesystems noatime. You should also leverage the Tuned framework with:

tuned-adm profile enterprise-storage

to set readahead, nobarrier and I/O elevator to a good baseline.

Edit:

There are plenty of recommendations surrounding XFS filesystem optimization. I've used the filesystem exclusively for the past decade and have had to occasionally adjust parameters as underlying changes to the operating system occurred. I have not experienced a dramatic performance decrease such as yours, but I also do not use LVM.

I think it's unreasonable to expect EL5 to act the same way as EL6, given the different kernel generation, compiled-in defaults, schedulers, packages, etc.

What would I do at this point??

I would examine the mkfs.xfs parameters and how you're building the systems. Are you using XFS partitioning during installation or creating the partitions after the fact? I do the XFS filesystem creation following the main OS installation because I have more flexibility in the given parameters.
My mkfs.xfs creation parameters are simple: mkfs.xfs -f -d agcount=32 -l size=128m,version=2 /dev/sdb1 for instance.
My mount options are: noatime,logbufs=8,logbsize=256k,nobarrier I would allow the XFS dynamic preallocation to run natively and not constrain it like you have here. My performance improved with it.
So I don't use LVM. Especially on top of hardware RAID... Especially on HP Smart Array controllers, where there are some LVM-like functions native to the device. However, using LVM, you don't have access to fdisk for raw partition creation. One thing that changed from EL5 to EL6 is the partition alignment in the installer and changes to fdisk to set the starting sector on a cylinder boundary.
Make sure you're running your HP Smart Array controllers and drives at the current revision level. At that point, it makes sense to update the entire server to the current HP Service Pack for ProLiant firmware revision. This is a bootable DVD that will upgrade all detected components in the system.
I'd check RAID controller settings. Pastebin the output of hpacucli ctrl all show config detail. Here's mine. You want a cache ratio biased towards writes versus reads. 75:25 is the norm. The default strip size of 256K should be fine for this application.
I'd potentially try this without LVM.
What are your sysctl.conf parameters?

Related Solutions

HP DL380 G7 – Poor RAID 10 Performance with Smart Array P410i

Your system is definitely underperforming based on your hardware specifications. I loaded the sysbench utility on a couple of idle HP ProLiant DL380 G6/G7 servers running CentOS 5/6 to check their performance. These are normal fixed partitions instead of LVM. (I don't typically use LVM, because of the flexibility offered by HP Smart Array controllers)

The DL380 G6 has a 6-disk RAID 1+0 array on a Smart Array P410 controller with 512MB of battery-backed cache. The DL380 G7 has a 2-disk enterprise SLC SSD array. The filesystems are XFS. I used the same sysbench command line as you did:

sysbench --init-rng=on --test=fileio --num-threads=16 --file-num=128 --file-block-size=4K --file-total-size=54G --file-test-mode=rndrd --file-fsync-freq=0 --file-fsync-end=off --max-requests=30000 run

My results were 1595 random reads-per-second across 6-disks.
On SSD, the result was 39047 random reads-per-second. Full results are at the end of this post...

As for your setup, the first thing that jumps out at me is the size of your test partition. You're nearly filling the 60GB partition with 54GB of test files. I'm not sure if ext4 has an issue performing at 90+%, but that's the quickest thing for you to modify and retest. (or use a smaller set of test data)
Even with LVM, there are some tuning options available on this controller/disk setup. Checking the read-ahead and changing the I/O scheduler setting from the default cfq to deadline or noop is helpful. Please see the question and answers at: Linux - real-world hardware RAID controller tuning (scsi and cciss)
What is your RAID controller cache ratio? I typically use a 75%/25% write/read balance. This should be a quick test. The 6-disk array completed in 18 seconds. Yours took over 2 minutes.
Can you run a bonnie++ or iozone test on the partition/array in question? It would be helpful to see if there are any other bottlenecks on the system. I wasn't familiar with sysbench, but I think these other tools will give you a better overview of the system's capabilities.
Filesystem mount options may make a small difference, but I think the problem could be deeper than that...

hpacucli output...

Smart Array P410i in Slot 0 (Embedded)    (sn: 50123456789ABCDE)

   array A (SAS, Unused Space: 0 MB)

      logicaldrive 1 (838.1 GB, RAID 1+0, OK)

      physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 300 GB, OK)
      physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 300 GB, OK)
      physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 300 GB, OK)
      physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 300 GB, OK)
      physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 300 GB, OK)
      physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 300 GB, OK)

   SEP (Vendor ID PMCSIERA, Model  SRC 8x6G) 250 (WWID: 50123456789ABCED)

sysbench DL380 G6 6-disk results...

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 16
Initializing random number generator from timer.

Extra file open flags: 0
128 files, 432Mb each
54Gb total file size
Block size 4Kb
Number of random requests for random IO: 30000
Read/Write ratio for combined random IO test: 1.50
Using synchronous I/O mode
Doing random read test
Threads started!
Done.

Operations performed:  30001 Read, 0 Write, 0 Other = 30001 Total
Read 117.19Mb  Written 0b  Total transferred 117.19Mb  (6.2292Mb/sec)
 1594.67 Requests/sec executed

Test execution summary:
    total time:                          18.8133s
    total number of events:              30001
    total time taken by event execution: 300.7545
    per-request statistics:
         min:                                  0.00ms
         avg:                                 10.02ms
         max:                                277.41ms
         approx.  95 percentile:              25.58ms

Threads fairness:
    events (avg/stddev):           1875.0625/41.46
    execution time (avg/stddev):   18.7972/0.01

sysbench DL380 G7 SSD results...

sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 16
Initializing random number generator from timer.


Extra file open flags: 0
128 files, 432Mb each
54Gb total file size
Block size 4Kb
Number of random requests for random IO: 30000
Read/Write ratio for combined random IO test: 1.50
Using synchronous I/O mode
Doing random read test
Threads started!
Done.

Operations performed:  30038 Read, 0 Write, 0 Other = 30038 Total
Read 117.34Mb  Written 0b  Total transferred 117.34Mb  (152.53Mb/sec)
39046.89 Requests/sec executed

Test execution summary:
    total time:                          0.7693s
    total number of events:              30038
    total time taken by event execution: 12.2631
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.41ms
         max:                                  1.89ms
         approx.  95 percentile:               0.57ms

Threads fairness:
    events (avg/stddev):           1877.3750/15.59
    execution time (avg/stddev):   0.7664/0.00

Using a DL380p Gen8 and D2700 Disk Enclosure

D2700 provides four physical interfaces per SFF port, each 6gb/s, which gives total 24gb/s per one physical port. In multipath configuration it can provide up to 48gb/s when using two HP HBA's and two storage controllers.

So, from the performance point of view - nothing changes to the worse when using D2700, also, you are getting more availability.