Adding 60TB Storage to an SLES 10 Server

dell-powervaultmultipathslesxfs

I have to add some archive\staging storage to an SLES 10 Server. The requirement is to present fairly large volumes (9-20TB each approx, 60TB or so in total) that will be used to store archive data (literally, this is for a library) comprising large image files (150Meg Tiff's for the most part) and large tarballs. The data will be overwhelmingly biased to read IO, certainly >95% and probably over 99%.

The storage has already been purchased – a Dell MD3000 SAS array daisy chained with 2 MD1000's fully populated with 2TB 7200 RPM SATA drives, 45 drives in total. The stack of arrays is connected using two dual ported external SAS adapters ie there are 4 paths to the stack.

My intention is to configure these as a set of 4 volumes sitting on 4 RAID groups with one hot spare per array. All groups will be RAID 6 with 7 or 14 drives, and each RAID group will be presented as a single LUN using all of the capacity in that group. On the SLES side these need to be formatted as XFS volumes.

I have limited experience with SLES (and Linux in general) and I'm looking for some recommendations about this, specifically:

  1. Are there any specific things to be careful about when configuring XFS volumes of this size under SLES 10, ie are default settings going to be OK given the IO profile?
  2. What's the best way to initialize\partition\format these? I used Parted to set a Disk label and the YAST Partition Manager (accepting all defaults) to create and format the XFS volume for my initial test.
  3. How do I set up multipathing? When I present an initial test volume it appears as four separate devices (/dev/sdl,/dev/sdm,/dev/sdn and /dev/sdn). What do I do to work with this as a single volume?
  4. In my initial testing I'm seeing transfer rates from an existing EMC Clariion SAN volume of around 30Meg/sec. This is a lot lower than I'd expect, even accounting for the RAID 6 write penalty I'd expected to see something in the ballpark of 70-100Meg/sec.
  5. How can I tell if everything is OK – where should I look for errors\warnings etc? The YAST Partition editor takes a very long time to launch for example and I'd like to understand why.
  6. Would you partition this differently and\or use a different filesystem and if so why?

The server is a Dell 2950 – I haven't checked the detailed specs but top shows utilization hovering in the low single digits at most.

Best Answer

At my previous job we had a similar problem. We were doing production for planetariums and each frame was 64 MegaPixels. Lots of large images. These would be processed for each theater in a very aggressive read operation over a cluster of computers.

The server in this case had a similar storage setup. Multiple external direct attached RAID arrays. Each of these were in RAID6 volumes exposed to the host and added to a VG (Volume Group) under LVM (Logical Volume Manager). Each show/production would then get their own LV (Logical Volume), formatted XFS, which we would grow with the project as required.

If your datasets are pretty static or grow in a predictable way like this then this approach should work well for you. But be careful this approach does have a downside. You end up having to micro-manage the LV's on your storage. Some admins prefer it this way but others would try to avoid it. But this allows you to grow each LV and XFS file system as the dataset grows. Keeping your XFS volumes as small as possible so that you don't get stuck with a fsck that takes years to complete. And can act as damage control should a file system go south.

Disclaimer: If I were to set this up today I would use OpenSolaris and ZFS. Mainly b/c it avoids the micro managing problems and is a superior file system/volume manager. So you may want to have a look at that as well.