dmesg – Normal Order of Messages When USB HDD is Failing

dmesgdrive-failurehdd

I have a USB hard drive attached to a Debian GNU/Linux server. I'm trying to format it (NTFS), with this command:

# mkntfs -v /dev/sdd1

which takes a few hours, because it checks the disk too. While checking, dmesg -T shows the following:

[Wed Jan 12 15:22:53 2022] sd 9:0:0:0: [sdd] Attached SCSI disk
[Wed Jan 12 18:03:26 2022] usb 1-4: USB disconnect, device number 5
[Wed Jan 12 18:03:26 2022] blk_update_request: I/O error, dev sdd, sector 621745808 op 0x1:(WRITE) flags 0x104000 phys_seg 240 prio class 0
[Wed Jan 12 18:03:26 2022] Buffer I/O error on dev sdd1, logical block 621743760, lost async page write
[Wed Jan 12 18:03:26 2022] Buffer I/O error on dev sdd1, logical block 621743761, lost async page write
   (...and so on for a few lines, then)
[Wed Jan 12 18:03:26 2022] blk_update_request: I/O error, dev sdd, sector 621746048 op 0x1:(WRITE) flags 0x104000 phys_seg 240 prio class 0
[Wed Jan 12 18:03:26 2022] blk_update_request: I/O error, dev sdd, sector 621746288 op 0x1:(WRITE) flags 0x100000 phys_seg 8 prio class 0
[Wed Jan 12 18:03:26 2022] blk_update_request: I/O error, dev sdd, sector 621746296 op 0x1:(WRITE) flags 0x800 phys_seg 16 prio class 0
   (...and so on for a few lines, then)
[Wed Jan 12 18:03:31 2022] buffer_io_error: 9015384 callbacks suppressed
   (...other errors...)

Looking at the sheer amount of error messages I'd say the HDD is almost dead, but attaching it to a Windows PC it seems to work. Moreover the first error (usb 1-4: USB disconnect, device number 5) that comes in dmesg before the other errors, makes me suspect the problem is not the HDD itself, but something else that causes the HDD to disconnect in the first place, and subsequent errors could be the obvious consequence of the disconnect.

However I'm not very experienced in dmesg output, so it's very possible that I'm reading it wrong.

EDIT: as requested by NiKiZe, here is the output of smartctl -a /dev/sdd:

# smartctl -a /dev/sdd
smartctl 6.6 2017-11-05 r4594 [x86_64-linux-5.10.0-3-amd64] (local build)
Copyright (C) 2002-17, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Western Digital Blue Mobile
Device Model:     WDC WD10SPCX-00KHST0
Serial Number:    WD-WXF1A95F0J3X
LU WWN Device Id: 5 0014ee 65b7e0332
Firmware Version: 01.01A01
User Capacity:    1.000.204.886.016 bytes [1,00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 1.5 Gb/s)
Local Time is:    Thu Jan 13 11:04:19 2022 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART Status not supported: Incomplete response, ATA output registers missing
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (16080) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 184) minutes.
Conveyance self-test routine
recommended polling time:    (   5) minutes.
SCT capabilities:          (0x7035) SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   190   184   021    Pre-fail  Always       -       1500
  4 Start_Stop_Count        0x0032   081   081   000    Old_age   Always       -       19048
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   073   073   000    Old_age   Always       -       20415
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       188
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       43
193 Load_Cycle_Count        0x0032   187   187   000    Old_age   Always       -       41054
194 Temperature_Celsius     0x0022   119   095   000    Old_age   Always       -       28
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

Assuming this HDD is really failing and that those error messages in dmesg are there for real bad sectors, why does dmesg show a disconnect before the bad sectors messages rather than after?

Best Answer

The disconnect happens first, then you get a mix of errors caused by the disconnect, where various layers that generate disk writes notice that something is wrong.

Most I/O happens in the form of a "dirty" memory page that has a disk backing, that the kernel eventually gets around to flushing to disk, long after the write has been acknowledged to the application, and the only driver layers that are still interested in the request are the memory subsystem (that wants to know if the page can be freed) and the file system (that needs to generate more writes to update metadata after the data has been written) -- anything above that has been dismantled already.

Some requests do not go through the memory subsystem (e.g. the SMART status query), some requests like reads are likely only attached to the memory subsystem as the file system has looked up the block mapping and no longer needs to be involved.

So, the error messages you get as a result of the disconnection have no particular order -- they happen as requests are generated, submitted and rejected due to the device being offline. If there is a causal relationship between some of the requests (like a journal entry that reflects the new state after the aborted data transfer -- after all, the journal could be on a different device), then this defines a bit of an order.

I'd probably try to use the disk on a proper SATA port, and with power from a good supply -- the voltage converters in USB-to-SATA enclosures are often a bit dodgy.