Without making enemies on the SAN team, how can I reassure myself and the application developers that our SQL servers aren't suffering from poorly configured storage? Just use perfmon stats? Other benchmarks like sqlio?
In short, there probably isn't a way to be truly sure. What I would say (I am a SAN admin), is that if your applications are performing up to your expectations, don't worry about it. If you start to see performance issues that you believe could be related to SAN/Disk IO performance, then it might be wise to inquire. I do not use much HP storage like you do, but in the IBM/NetApp world I can say from experience that there aren't many options which would allow you to configure it "poorly". Most enterprise storage these days takes a lot of the guesswork out of building raid arrays, and doesn't really let you do it wrong. Unless they are mixing drive speeds and capacities within the same raid groups you can rest-assured in most cases that your disk is performing fine.
If I load test on these SAN drives, does that really give me a reliable, repeatable measure of what I will see when we go live? (assuming that the SAN software might "dynamically configure" differently at different points in time.)
Load testing should be plenty reliable. Just keep in mind that when you are load testing one box, that being on a shared SAN/Disk Array that its performance can (and will) be affected by other systems using the same storage.
Does heavy IO in one part of the SAN (say the Exchange server) impact my SQL servers? (assuming they aren't giving dedicated disks to each server, which I've been told they are not)
It can. It is not all about the disks, or which disks, the servers are on. All of the data is being served up via a disk controller, and then a SAN switch. The performance you will see greatly depends on how the disk controller is connected to is corresponding disk shelves, and the corresponding SAN. If the entire array connects to the backbone SAN on one single strand of 4gbps fiber, then clearly the performance will be impacted. If the array is connected across two redundant SAN's which are load balanced, using trunked links, then it would impossible for exchange alone to suck up too much bandwidth. Another thing which needs to be considered is how many IO/sec the array is capable of. As long as the array and the SAN it is connected to are scaled correctly, heavy IO in other parts of the SAN environment should not impact your SQL performance.
Would requesting separating logical drives for different functions logical drives (data vs log vs tempdb) help here? Would the SAN see the different IO activity on these and optimally configure them differently?
That is probably a matter of preference, and also greatly depends on how your storage admins configure it. They could give you three LUNs in the same array or volume, in which case its all the same anyway. If they gave you individual LUNs on different arrays, in different volumes (physically different disks), then it might be worth it for you to separate them.
We're in a bit of a space crunch right now. Application teams being told to trim data archives, etc. Would space concerns cause the SAN team to make different decisions on how they configure internal storage (RAID levels, etc) that could impact my server's performance?
I don't imagine your storage admin would change the raid level in order to free up space. If he would, then he should probably be fired. Space concerns can lead things to be configured differently, but not normally in a performance-impacting way. They might just become a little more tight about how much space they give you. They might enable features such as data de-duplication (if the array supports it) which can hinder the performance of the array while the process runs, but not around the clock.
Backups serve two functions.
- First and foremost, they're there to allow you to recover your data if it becomes unavailable. In this sense, snapshots are not backups. If you lose data on the filer (volume deletion, storage corruption, firmware error, etc.), all snapshots for that data are gone as well.
- Secondly, and far more commonly, backups are used to correct for routine things like accidental deletions. In this use case, snapshots are backups. They're arguably one of the best ways to provide this kind of recovery, because they make the earlier versions of the data available directly to the users or their OS as a .snapshot hidden directory that they can directly read their file from.
No retention policy
That said, while we have snapshots and use them extensively, we still do nightly incrementals on Netbackup to tape or data domain. The reason is that snapshots can not reliably uphold a retention policy. If you tell users that they will be able to back up from a daily granularity for a week then a weekly granularity for a month, you can't keep that promise with snapshots.
On a Netapp volume with snapshots, deleted data contained in a snapshot occupies "snap reserve" space. If the volume isn't full and you've configured it this way, you can also push past that snapshot reserve and have snapshots that occupy some of the unused data space. If the volume fills up, though, all the snapshots but the ones supported by data in the reserved space will get deleted. Deletion of snapshots is determined only by available snapshot space, and if it needs to delete snapshots that are required for your retention policy, it will.
Consider this situation:
- A full volume with regular snapshots and a 2 week retention requirement.
- Assume half of the reserve in use for snapshots based on the normal rate of change.
- Someone deletes a lot of data (more than the snapshot reserve), drastically increasing the rate of change, temporarily.
At this point, your snapshot reserve is completely used, as is as much of the data free space you've allowed OnTap to use for snapshots, but you haven't lost any snapshots yet. As soon as someone fills the volume back up with data, though, you'll lose all the snapshots contained in the data section, which will push your recovery point back to the time just after the large deletion.
Summary
Netapp snapshots don't cover you against real data loss. An errant deleted volume or data loss on the filer will require you to rebuild data.
They are a very simple and elegant way to allow for simple routine restores, but they aren't reliable enough that they replace a real backup solution. Most of the time, they'll make routine restores simple and painless, but when they're not available, you are exposed.
Best Answer
Disk fragmentation in VMs is just a fact of life for the reason you've described. In most workloads, it's not something that needs to be worried about. If you have workloads that do massive sequential IO, this might be a cause of poor performance, but not on random read/write workloads.
There's a reason VMs are sometimes referred to as I/O blenders :)