Sql-server – SQL Server disk design on an ISCSI SAN

dell-equallogiciscsilunsql serverstorage-area-network

Its standard practice to separate log and data files to separate disks away from the OS
(tempdb, backups and swap file also)
Does this logic still make sense when your drives are all SAN based and your LUNS are not carved of specific disk or raid sets -they are just part of the x number of drives on the SAN and the LUN is just space allocation

Best Answer

Logs and data drives have different data access patterns that are in conflict with each other (at least in theory) when they share a drive.

Log Writes

Log access consists of a very large number of small sequential writes. Somewhat simplistically, DB logs are ring buffers containing a list of instructions to write data items out to particular locations on the disk. The access pattern consists of a large number of small sequential writes that must be guaranteed to complete - so they are written out to disk.

Ideally, logs should be on a quiet (i.e. not shared with anything else) RAID-1 or RAID-10 volume. Logically, you can view the process as the main DBMS writing out log entries and one or more log reader threads that consume the logs and write the changes out to the data disks (in practice, the process is optimised so that the data writes are written out immediately where possible). If there is other traffic on the log disks, the heads are moved around by these other accesses and the sequential log writes become random log writes. These are much slower, so busy log disks can create a hotspot which acts as a bottleneck on the whole system.

Data Writes

(updated) Log writes must be committed to the disk (referred to as stable media) for a transaction to be valid and eligible to commit. One can logically view this as log entries being written and then used as instructions to write data pages out to the disk by an asynchronous process. In practice the disk page writes are actually prepared and buffered at the time the log entry is made, but they do not need to be written immediately for the transaction to be committed. The disk buffers are written out to stable media (disk) by the Lazy Writer process (Thanks to Paul Randal for pointing this out) which This Technet article discusses in a bit more detail.

This is a heavily random access pattern, so sharing the same physical disks with logs can create an artificial bottleneck on system performance. The log entries must be written for the transaction to commit, so having random seeks slowing down this process (random I/O is much slower than sequential log I/O) will turn the log from a sequenital into a random access device. This creates a serious performance bottleneck on a busy system and should be avoided. The same applies when sharing temporary areas with log volumes.

The role of caching

SAN controllers tend to have large RAM caches, which can absorb the random access traffic to a certain extent. However, for transactional integrity it is desirable to have disk writes from a DBMS guaranteed to complete. When a controller is set to use write-back caching, the dirty blocks are cached and the I/O call is reported as complete to the host.

This can smooth out a lot of contention problems as the cache can absorb a lot of I/O that would otherwise go out to the physical disk. It can also optimise the parity reads and writes for RAID-5, which lessens the effect on performance that RAID-5 volumes have.

These are the characteristics that drive the 'Let the SAN deal with it' school of thought, althoug this view has some limitations:

  • Write-back caching still has failure modes that can lose data, and the controller has fibbed to the DBMS, saying blocks have been written out to disk where in fact they haven't. For this reason, you may not want to use write-back caching for a transactional application, particlarly something holding mission-critical or financial data where data integrity problems could have serious consequences for the business.

  • SQL Server (in particular) uses I/O in a mode where a flag (called FUA or Forced Update Access) forces physical writes to the disk before the call returns. Microsoft has a certification program and many SAN vendors produce hardware that honours these semantics (requirements summarised here). In this case no amount of cache will optimise disk writes, which means that log traffic will thrash if it is sitting on a busy shared volume.

  • If the application generates a lot of disk traffic its working set may overrun the cache, which will also cause the write contention issues.

  • If the SAN is shared with other applications (particularly on the same disk volume), traffic from other applications can generate log bottlenecks.

  • Some applications (e.g. data warehouses) generate large transient load spikes that make them quite anti-social on SANs.

Even on a large SAN separate log volumes are still recommended practice. You may get away with not worring about layout on a lightly used application. On really large applications, you may even get a benefit from multiple SAN controllers. Oracle publish a series of data warehouse layout case studies where some of the larger configurations involve multiple controllers.

Put responsibility for performance where it belongs

On something with large volumes or where performance could be an issue, make the SAN team accountable for the performance of the application. If they are going to ignore your recommendations for configuration, then make sure that management are aware of this and that responsibility for system performance lies in the appropriate place. In particular, establish acceptable guidelines for key DB performance statistics like I/O waits or page latch waits or acceptable application I/O SLA's.

Note that having responsibility for performance split across multiple teams creates an incentive to finger-point and pass the buck to the other team. This is a known management anti-pattern and a formula for issues that drag out for months or years without ever being resolved. Ideally, there should be a single architect with authority to specify application, database and SAN configuration changes.

Also, benchmark the system under load. If you can arrange it, secondhand servers and direct-attach arrays can be purchased quite cheaply on Ebay. If you set up a box like this with one or two disk arrays you can frig with the physical disk configuration and measure the effect on performance.

As an example, I have done a comparison between an application running on a large SAN (an IBM Shark) and a two-socket box with a direct attach U320 array. In this case, £3,000 worth of hardware purchased off ebay outperformed a £1M high-end SAN by a factor of two - on a host with roughly equivalent CPU and memory configuration.

From this particular incident, it might be argued that having something like this lying around is a very good way to keep SAN administrators honest.