How to safely use storage thin provisioning

storagethin-provisioning

I have storage that allows me to thin provision my volumes presented to the clients. Is this safe? What are the best practices?

Best Answer

Generically, whether you're talking about SCSI LUNs (SAN) or network file systems (NAS), thin provisioned storage is when you tell the storage client that it has more space than you've actually allocated to it. This has no risks on its own, but if you don't have enough actual storage to allow every single container to grow to the full promised size, that's called overprovisioning and it entails risk.

Advantages

The advantages of overprovisioning and thin provisioning are compelling. Many consumers of storage (servers, file share users, etc.) will request far more storage than they initially need, and continue to ensure they have a safe margin for growth as they grow. A centrally provisioned safe margin for growth is far more efficient than hundreds of small ones. The utilization of the underlying storage without thin/overprovisioning can be very low, and this allows a higher rate of utilization.

Risks

All the risks of this scenario are linked with overprovisioning. The more you overprovision, the higher your risk. The danger is the potential for the utilization of storage resources to completely fill the available storage, which will generally cause all the storage containers to fail in one way or another. Filesystems will go read only or offline and LUNs will go offline.

Best practice

In order to get the benefits of higher utilization that come with overprovisioning while mitigating the risk, you need to constantly monitor the storage and be able to take action when required.

  • Use software to monitor and alert on pool utilization conditions. If there's nothing in a box that will do this, write it yourself. Most storage supports CLI commands that can be read by a script that you schedule to run frequently. The frequency should be high enough that none of your pools is capable of filling up between polling events.
  • Establish a baseline threshold. All new pools of storage with overprovisioned clients should get this applied by default. This threshold should be the most conservative one in your environment.
  • For smaller pools, use a lower threshold. If you give yourself 30% of warning on a 100TB pool, you have a lot more time to add disk than if you have 30% warning on a 10TB pool, assuming they are both capable of ingesting writes at the same speed.
  • Adjust the threshold up if you're less overprovisioned. If you have a pool that's only 106% overprovisioned, hitting 70% utilization isn't nearly as risky as a pool that's at 200% overprovisioning.
  • Adjust your thresholds based on how much time you need to add space to a pool. In my shop, we keep online storage in each box held back for growth in any pool, and more storage on a shelf ready to be installed into any storage box. We do this for enough types of storage that we can handle growth in any pool.
  • Wherever possible and applicable, thin out your storage. Deduplication works to decrease your utilization, and if you are using LUNs, zero page reclaim and clients that are able to perform storage unallocates when they delete data both help.