Ubuntu – Do I need to replace the NVME SSD

I have a simple server setup:

2 NVME SSD disks (both SAMSUNG MZVLB1T0HALR-00000 for 1TB) united into RAID0.

OS Ubuntu 19.04

Today my system stopped responding. Reboot didn't help.
I connected via KVM and noticed these error messages at booting screen:

md/raid0:md0: too few disks (1 of 2) - aborting!
md: pers->run() failed ...
mdadm: failed to start array /dev/md/0: Invalid argument
md/raid1:md1: active with 1 out of 2 mirrors
md1: detected capacity change from 0 to 536281088
md/raid0:md2: too few disks (1 or 2) - aborting!
md: pers->run() failed ...
mdadm: failed to start array /dev/md/2: Invalid argument

Then booted to the rescue system and tried to check the disks for errors but I couldn't find the 2nd disk. There was only /dev/nvme0 but no /dev/nvme1.

I wrote to technical support (my server is at Hetzner) and asked them to check the disks for me. They shut down the server for a minute, then turned it on and were able to see the 2nd disk in rescue system.

They checked both drives for errors and 1st one showed some SMART errors:

sudo nvme smart-log /dev/nvme0
Smart Log for NVME device:nvme0 namespace-id:ffffffff
critical_warning                    : 0
temperature                         : 33 C
available_spare                     : 100%
available_spare_threshold           : 10%
percentage_used                     : 21%
data_units_read                     : 279,672,974
data_units_written                  : 366,481,283
host_read_commands                  : 2,479,016,466
host_write_commands                 : 2,637,293,356
controller_busy_time                : 19,928
power_cycles                        : 10
power_on_hours                      : 5,153
unsafe_shutdowns                    : 4
media_errors                        : 21
num_err_log_entries                 : 26
Warning Temperature Time            : 0
Critical Composite Temperature Time : 0
Temperature Sensor 1                : 33 C
Temperature Sensor 2                : 39 C
Thermal Management T1 Trans Count   : 0
Thermal Management T2 Trans Count   : 0
Thermal Management T1 Total Time    : 0
Thermal Management T2 Total Time    : 0

They told me the disk looked failed and needed to be replaced. Of course all data is to be lost.

I tried to simply reboot the system once again (because they managed to connect the 2nd disk back) and the system loaded normally!

Then I tried to read error log with nvme error-log command but it shows only "SUCCESS" entries:

sudo nvme error-log /dev/nvme0
Error Log Entries for device:nvme0 entries:64
.................
 Entry[ 0]
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
cs           : 0
.................
 Entry[ 1]
.................
error_count  : 0
sqid         : 0
cmdid        : 0
status_field : 0(SUCCESS: The command completed successfully)
parm_err_loc : 0
lba          : 0
nsid         : 0
vs           : 0
cs           : 0
...and so on

The system seems working normally. I don't know what was that. But for some reason one of the disks suddenly stopped and didn't want to start until full reboot with a pause was done.

Now I'm wondering – is there a way to read the actual error log? And test the disks to ensure that it really needs to be replaced?

Best Answer

Things should be fine, if the system works as expected and no other errors are reported.

According to NVM Express Management Interface description: "A Response Message Status value other that Success indicates that an error occurred [...]"

Obviously, the disk BIOS is using the ERROR log to report SUCCESS!

Best Answer

Related Solutions

Linux – Odd mdadm output: –examine shows array state failed, –detail shows everything clean

Get exact SMART error from Intel RST (Rapid Storage Technology)

Related Topic