Find what is wearing out the SSDs

ciscoredhatssd

We have 8 Cisco servers with 12 spinning disks for data and 2 SSDs for OS. The 2 SSDs are in linux software raid 1. The SSDs all have their wear indicator in single digits and some of those that have reached a value of 1 have failed. I'm in the process of swapping them all out from spares (a long and tiresome process) but I've noticed the wear indicator is dropping 1 or 2% per week (I didn't take exact measurements). There is a single application running on these servers and the vendor has given me some vague ideas, but I really need to find the directories that it is writing to. That way I can really highlight the problem and push the vendor for a fix. I've searched a bit but haven't been able to find too much. iotop for example show full disk throughput including the 12 spinning disks. OS is Redhat 7.9

In answer to some of the questions:

  • disks are "480GB 2.5 inch Enterprise Value 6Gb SATA SSD"
  • product ID is "UCS-SD480GBKS4-EB"
  • disks were supplied standard with the servers in 2018
  • The wearing out appears to have accelerated recently (I am now logging the wear so will have a better answer on that in a few days)
  • I have replaced most disks with identical disks purchased maybe a couple of years later.
  • iotop is showing a constant 8MB/s write.
  • the system is running hadoop across 8 servers. The hadoop file system is on spinning disks so shouldn't touch the SSDs
  • I have reduced the disk IO considerably on suggestion of the vendor although it does still seem high (8MB/s)

Best Answer

You can use ProcMon for Linux to trace file system calls.

https://github.com/Sysinternals/ProcMon-for-Linux