Smart long test – what is the performance impact on busy server

hard driveraidraid1smartsmartctl

I have a busy server with a RAID 1 setup. The application (runs in PHP) is very read/write intensive to the database (MariaDB).

A cronjob runs smartctl short test every day and checks the output of smartctl -H and mdadm -D.

I would like to run a long test sometimes, but I am concerned about its performance impact. I read that it can take several hours to complete. If it causes server performance to degrade while it runs, my users will be affected for 5+ hours.

So, a few questions here:

1) Do long smart tests usually impact performance that can be significant to users?

2) Since I have RAID 1 and do short tests, are long tests still necessary?

3) Is there a way to stop a long test if I find it is causing trouble on server performance?

Best Answer

  1. It depends. (muhahaha) On what? How much your application uses the disk and how much caching can your application take advantage of. There's no magic here - if the drive is being tested, it cannot serve at the same maximum speed and low latency as when it is not. However, if your minimum latency requirements are less than the impact of the tests, then it is a wash in terms of application impact.
  2. Probably not. Big Enterprise storage companies (EMC, IBM, NetApp, etc.) replace drives based on their M(ean) T(ime) B(etween) F(ailures), not merely waiting for the drive's firmware to report a problem. The short tests test everything that is likely to fail first. The long tests do the same tests as the short tests, except they do not have a time limit. Just assume that all drives will fail, but their likelyhood is higher after the warranty expires.
  3. Yes, PROVIDED that the drive supports aborting (or "suspending") offline collection.

From the 'smartctl(8)' manual page:

       -X, --abort
              Aborts  non-captive  SMART  Self  Tests.  Note that this command
              will abort the Offline Immediate Test routine only if your  disk
              has the "Abort Offline collection upon new command" capability.

(I think you can also abort if you see "Suspend Offline collection upon new command"; I think the man page needs to be updated.) You can check for that capability via:

smartctl -x <device>
Related Topic