Server hangs on file copy

raid5storagewindows-server-2012-r2

We are having an issue with one of our servers. When we copy bigger files (large means 50MB and larger in that case), the copy operation (C:\ to C:) starts normally, but then starts to lag, going down to 100kb/s and making the whole server hang (our application can't return results from the SQL Server anymore, so the application hangs for users).

The Intel RST shows all green on SMART. Here are the system specs:

  • Server: HPE ML10
  • Storage: 3x HP 3TB in RAID5 configuration
  • OS: Windows Server 2012 R2
  • Server Roles: Domain Controller, Application Server (SQL Server and .NET app)
  • Storage Settings: Stripe size: 128KB, Write-cache buffer flushing: Enabled, Cache-Mode: Off, Size of physical and logic sectors: 512 bytes

I'm not a server expert, so I'm not sure if I have those things set up correctly. What could be the problem here?

EDIT: I'm not an expert on these things (developer). So maybe I'm doing something simple wrong.

EDIT2: http://imgur.com/a/NNgDY Disk write performance is extremely poor. But no total hangs like when I copy with Windows Explorer. I guess the hanging explorer jams the message pump, clogging the system. Could a migration to a RAID1/10 fix the issue in your opinions?

Best Answer

If I interpret "Cache-Mode: Off" correctly it's completely understandable that write performance sucks. Check out whether copying/reading from the RAID (to network or NUL) is I problem or copying/writing to the RAID - I my guess is correct, only writing to the RAID is a pain.

RAID5 is distributed - each stripe consists of (in your case) three segments: data1, data2 and parity12. Now, when some data is written to the array it can't just be written to a data segment because the parity wouldn't match any more.

If data1 is written to/changed the controller needs to either:

  1. read data2, recalculate parity12, write data1, write parity12 (for small arrays)
  2. read old data1, read parity12, remove old data1 parity from parity12, recalculate parity12 with new data1, write data1, write parity12 (for larger arrays)

So, whenever there's a change the controller operations are amplified by three! If these cannot be cached, each write will result in three operations to be performed and your application needs to wait. With cache, many read and write operations can be omitted and the performance hit will be much less.

The only exception to this write operation amplification is when you write a whole stripe at once: just take data1 and data2 from buffer, calculate parity12 and write all three segments. That's an amplification by just 1.5. However, for being able to combine all incoming data into full stripes you need to be able to queue the data. Guess what, you need cache again.

In a nutshell: if you use RAID5 or RAID6 you absolutely require cache - it's not a luxury. Too little or even no cache at all will kill your performance. If it's a software or hosted RAID with configurable cache, set aside at least 512 MB, better 1 or 2 GB and it'll "fly". RAID5 with three drives will be no performance wonder but it can work fine.

Edit: the HP ML10 G9 has a chipset-integrated Intel RST SATA RAID controller - host RAID. Depending on which exact model and controller is used, cache should be configurable somewhere.