Performance Improvements for Large Filesystems and High IOWAIT on Ubuntu 16.04

ext4performance-tuningubuntu-16.04

I have a Ubuntu 16.04 Backup Server with 8x10TB HDD via a SATA 3.0 Backplane. The 8 Harddisks are assembled to a RAID6, an EXT4 Filesystem is in use. This Filesystem stores a huge amount of small files with very many SEEK operations but low IO throughput. In fact there are many small files from different servers which get snappshotted via rsnapshot every day (multiple INODES direct to the same files. I have a very poor performance since the file system (60TB net) exceeded 50% usage. At the moment, the usage is at 75% and a

du -sch /backup-root/

takes several days(!). The machine has 8 Cores and 16G of RAM. The RAM is totally utilized by the OS Filesystem Cache, 7 of 8 cores always idle because of IOWAIT.

Filesystem volume name:   <none>
Last mounted on:          /
Filesystem UUID:          5af205b0-d622-41dd-990e-b4d660c12bd9
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent 64bit flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              912203776
Block count:              14595257856
Reserved block count:     0
Free blocks:              4916228709
Free inodes:              793935052
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         2048
Inode blocks per group:   128
RAID stride:              128
RAID stripe width:        768
Flex block group size:    16
Filesystem created:       Wed May 31 21:47:22 2017
Last mount time:          Sat Apr 14 18:48:25 2018
Last write time:          Sat Apr 14 18:48:18 2018
Mount count:              9
Maximum mount count:      -1
Last checked:             Wed May 31 21:47:22 2017
Check interval:           0 (<none>)
Lifetime writes:          152 TB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               256
Required extra isize:     28
Desired extra isize:      28
Journal inode:            8
First orphan inode:       513933330
Default directory hash:   half_md4
Directory Hash Seed:      5e822939-cb86-40b2-85bf-bf5844f82922
Journal backup:           inode blocks
Journal features:         journal_incompat_revoke journal_64bit
Journal size:             128M
Journal length:           32768
Journal sequence:         0x00c0b9d5
Journal start:            30179

I'm lacking experience with this kind of filesystem usage. What options do I have to tune this. What filesystem would perform better with this scenario? Are there any options to involve RAM for other caching options than the OS-build-in one?

How do You handle very large amounts of small files on large RAID assemblies?

Thanks,
Sebastian

Best Answer

I have a similar (albeit smaller) setup, with 12x 2TB disks in a RAID6 array, used for the very same purpose (rsnapshot backup server).

First, it is perfectly normal for du -hs to take so much time on such a large, and used, filesystem. Moreover du accounts for hardlinks, which cause considerable and bursty CPU load in addition to the obvious IO load.

Your slowness is due to the filesystem metadata being located in very distant (in LBA terms) blocks, causing many seeks. As a normal 7.2K RPM disk provides about ~100 IOPS, you can see how hours, if not days, are needed to load all metadata.

Something you can try to (non-destructively) ameliorate the situation:

be sure to not having mlocate/slocate indexing your /backup-root/ (you can use the prunefs facility to avoid that), or metadata cache trashing will severly impair your backup time;
for the same reason, avoid running du on /backup-root/. If needed, run du only on the specific subfolder interested;
lower vfs_cache_pressure from the default value (100) to a more conservative one (10 or 20). This will instruct the kernel to prefer metadata caching, rather than data caching; this should, in turn, speed up the rsnapshot/rsync discovery phase;
you can try adding a writethrough metadata caching device, for example via lvmcache or bcache. This metadata device should obviously be an SSD;
increase your available RAM.
as you are using ext4, be aware of inode allocation issues (read here for an example). This is not directly correlated to performance, but it is an important factor when having so many files on an ext-based filesystem.

Other things you can try - but these are destructive operations:

use XFS with both -ftype and -finobt option set;
use ZFS on Linux (ZoL) with compressed ARC and primarycache=metadata setting (and, maybe, an L2ARC for read-only cache).

Related Solutions

Linux – Does increasing the journal size improve performance for ext4 filesystems

Like you said. If the journal is bigger you have more possibilities with that filesystem but I don't really think that's the only way you're getting the better performance. You don't need a partition to try the new ext4 on a filesystem. You can create an image file with the dd command like so : Create a 1 GiB file containing only zeros (bs=blocksize, count=number of blocks):

dd if=/dev/zero of=file1G.tmp bs=1M count=1024

then you can create a ext4 filesystem in that file:

mkfs.ext4 /path/to/file1G.tmp

Oracle Application Server Performance Monitoring and Tuning (CPU load high)

Heavy PL/SQL calls should block the thread - so the CPU usage should drop.

My first port of call for a slow application server is to check the gc logs - looking for frequent major collections (which indicate either a memory leak or that the JVM simply eeds more memory).

The systems I look after became a lot more stable after switching from the thick Oracle drivers to the lightweight jdbc drivers - although the issues had mainly manifested as the container crashing.

The logs should be a good indicator of any problems on the system - but a lot depends on what the developers choose to write in there. Slow SQL could result in the connection pool being exhausted - make sure that the pool is logging connection stats. Also make sure the ulimit is set correctly for the JVM.

Since you're running 9i at the DB tier you wont have the AWR functionality - you'll have to run the statspack (but this should already be standard practice for your sites performance management) to identify what's causing problems at the DB.

THe gradual degrading of performance is indicative of a memory leak in the application - usually this is caused by objects not being de-referenced and hence eligible for garbage collection - i.e. a programming issue. This should be apparent from most Java profiling tools.

ı noticed that there are some threads showing as a hogging

Unless you're testing this with a realistic workload the results are pretty much useless.

Best Answer

Related Solutions

Linux – Does increasing the journal size improve performance for ext4 filesystems

Oracle Application Server Performance Monitoring and Tuning (CPU load high)

Related Topic