Filesystem performance degraded during RAID rebuilding

filesystemsperformanceraid

So quick question – our RAID6 array is currently rebuilding and there is a VERY noticeable filesystem performance hit (home directories are NFS mounted on the array).

I'd sort of expect that, given you're rebuilding the array so there's massive read/write burden on the controller, but it occurred to me I don't really have anything to compare this to.

Is seeing serious (5-10 second freezes pretty frequently) an expected kind of behavior during RAID rebuilding coupled with heavy read/write usage (performance takes a noticeable hit during backups and when users are downloading large [multi GB] files via FTP).

Any thoughts on this would be appreciated. This is hardware RAID6 (LSI 9266-i8) on a 40TB array mounted over NFS locally (i.e. the server is physically very close to the workstations).

Best Answer

First, here is a great resource that outlines rebuild times.

RAID rebuilds and how they work pre and post failure.

Now, as far as my thoughts about the rebuild, we do know that rebuilds make for some really sluggish performance and rightfully so. As you will see from my link above, RAID rebuilds are not only extracting information from a failed disk to the good known disks (in the event of a post failure rebuild), they are also writing information to the system drive as well as other data/secondary drives all the while the server operates. Another thing to keep in mind is that usual functions that you would normally see take no time and relatively little resources at all now take more resources than normal and tax an already taxed server. In the event of a pre-rebuild failure (a little better on performance, but not much) You can get lucky and have a drive (logical or physical) fail and the RAID rebuild before end users (hopefully you as an SA should have some sort of alerting system so you shouldn't be surprised by it) even know anything had a problem.

The 5-10 second freezes you see are definitely normal and especially if the server you are rebuilding on is any kind of a database server that has higher than usual writes and reads by default (i.e. a SQL server that houses a database that end users access all day long; a property management company I used to consult for had a program that accessed their tenant records all day long for viewing and writing new information to them and it always had heavy usage.) it will be more noticeable.

Another thing I recommend is to get whatever RAID utility (the GUI version) comes with your controller and install it on the operating system so you can monitor the rebuild without having to load into a Controller BIOS.

A very small and almost non-existant issue these days is NFS vs iSCSI. I know you're using NFS and it used to be that iSCSI would have better overall performance in the case of virtualization, but with recent improvements to hypervisors and hard drives, as well as controllers, NFS is almost identical in performance to iSCSI so it sounds like you have a very nice SAN.

I'd be happy to answer anything else you need to know, so please feel free to comment.

Related Solutions

Linux – Severe write performance problem

Some good suggestions here from other posters about ruling out the software, and tweaking your RAID performance. It's worth mentioning that if your workload is write-heavy, then going for hardware with a battery-backed write cache is likely to be the right thing to do if you're considering replacing your kit.

Windows file server performance tuning

Disk Subsystem: Here's an article from Microsoft re: partition alignment in SQL Server 2008: http://msdn.microsoft.com/en-us/library/dd758814.aspx

The theory explained in the article is why I'm giving you the link, not 'cuz I think you'll be running SQL Server. The workload of a file server is less apt to be as touchy about partition alignment as SQL Server, but every little bit helps.

NTFS:

You can disable last access time stamping in NTFS with:

fsutil behavior set disablelastaccess 1

You can disble short filename creation (if you have no apps that need it) with:

fsutil behavior set disable8dot3 1

Think about the best NTFS cluster size for the kinds of files you're going to be putting on the box. In general, you want to have as large a cluster size as you can get away with, balancing that against wasted space for sub-cluster-sized files. You also want to try and match your cluster size to your RAID stripe size (and, as was said above, have your stripes aligned to your clusters).

There's a theory that most reads are seqential, so the stripe size (which is typically the minimum read of the RAID controller) should be a multiple of the cluster size. That depends on the specific workload of the server and you'd need to measure it to know for sure. I'd keep them the same.

If you're going to have a large number of small files you may want to start with a larger reserve for the NTFS master file table (MFT) to prevent future MFT fragmentation. As well as talking about the fsutil command above, this document describes the "MFT zone" setting: http://technet.microsoft.com/en-us/library/cc785435(WS.10).aspx Basically, you want to reserve as much disk space for the MFT as you think you'll need, based on a predicted number of files you'll have on the volume, to try and prevent MFT fragmentation.

A general guide from Microsoft on NTFS performance optimization is available here: http://technet.microsoft.com/en-us/library/cc767961.aspx It's an old document, but it gives some decent background nonetheless. Don't necessarily try any of the "tech stuff" it says to do, but get concepts out of it.

Layout:

You'll have religious arguments with people re: separating the OS and data. For this particular application, I'd probably pile everything into one partition. Someone will come along and tell you that I'm wrong. You can decide yourself. I see no logical reason to "make work" down the road when the OS partition fills up. Since they're not separate RAID volumes, there's no performance benefit to separating the OS and data into partitions. (It would be a different story if they were different spindles...)

Shadow Copies:

Shadow copy snapshots can be stored in the same volume, or on another volume. I don't have a lot of background on the performance concerns associated with shadow copies, so I'm going to stop there before I say something dumb.

Best Answer

Related Solutions

Linux – Severe write performance problem

Windows file server performance tuning

Related Topic