Risk of not repairing “Structure needs cleaning” XFS errors

corruptionxfs

I have an XFS file system with file system errors affecting some non-critical files. I wish to repair it; the business wishes to continue to run with those errors. What are the known risks of not repairing an XFS file system that has "Structure needs cleaning" errors?

The business wishes to avoid the possibly lengthy maintenance window that will be needed. I have always taken it on faith that file system corruption must not be tolerated. The business is going to ask me for reasons to fix it other than my own FUD.

What kind of answers are needed

I already have an opinion; I need more than that.

Answers should be backed by evidence (anecdotes are OK, but only if they are documented first-hand. We don't need "someone told me" answers). Expert opinions are OK, such as answer from the XFS FAQ, or from a developer familiar with XFS internals).

No lay opinions, please. I'm looking for evidence, reliable anecdote, and XFS expert opinion.

Negative answers (e.g. "under similar circumstances, I ran for a year and experienced no serious problems) are OK.

File system details.

The file system is 5.4T, with 3.9T (72%) used.

There are 46.6M files.

Error details

There are 55 corrupt directories that cause applications such as ls and find to report "Structure needs cleaning", as mentioned in this XFS FAQ entry:

Q: I see applications returning error 990 or "Structure needs cleaning", what is wrong?

The error 990 stands for EFSCORRUPTED which usually means XFS has detected a filesystem metadata problem and has shut the filesystem down to prevent further damage. Also, since about June 2006, we converted from EFSCORRUPTED/990 over to using EUCLEAN, "Structure needs cleaning."
The cause can be pretty much anything, unfortunately – filesystem, virtual memory manager, volume manager, device driver, or hardware.
There should be a detailed console message when this initially happens. The messages have important information giving hints to developers as to the earliest point that a problem was detected. It is there to protect your data.
You can use xfs_repair to remedy the problem (with the file system unmounted).

XFS errors logged to syslog all look like this:

XFS (sdb): Metadata corruption detected at xfs_inode_buf_verify+0x6d/0xe0 [xfs], block 0x50
XFS (sdb): Unmount and run xfs_repair
XFS (sdb): First 64 bytes of corrupted metadata buffer:
ffff88073fa79000: 49 4e 41 ff 02 01 00 00 00 00 01 f6 00 00 01 f7  INA.............
ffff88073fa79010: 00 00 00 04 00 00 00 00 00 00 00 00 00 00 00 ed  ................
ffff88073fa79020: 59 1b af d2 09 62 5c 17 4f e8 f8 73 00 00 00 00  Y....b\.O..s....
ffff88073fa79030: 57 e0 73 b2 27 23 63 cd 00 00 00 00 00 00 00 2f  W.s.'#c......../
XFS (sdb): metadata I/O error: block 0x50 ("xfs_trans_read_buf_map") error 117 numblks 16
XFS (sdb): xfs_imap_to_bp: xfs_trans_read_buf() returned error 117.

These errors are repeated many times but only for two blocks.

Best Answer

The filesystem should be really taken offline and checked/repaired, for at least two very good reason:

  • metadata error on directories will basically lock them out of your control. You can not ls them, or create/remove files inside them.
  • a metadata error can trigger XFS fail-safe mechanism - filesystem shutdown. If that happen, your customer will take an unscheduled downtime, maybe at the worst moment ever. It is much better to scheduler for downtime in quiet hours (ie: during the night).

Some suggestions:

  • before running the full-scale xfs_repair, you can dump all filesystem metadata using xfs_metadump and run a "dummy" xfs_repair on them. This will give you the possibility to observe what xfs_repair will do with/at your filesystem
  • be sure to have valid and recent backups before any repair attempt
  • if you really, really, really can not bring the filesystem down and if the files contained in the problematic directories are of no/little importance, you can try to remove the directories themselves. This will effectively "disconnect" the problematic metadata area. Be sure to understand that this is only a (bad) workaround; moreover, if the remove fails, XFS will probably shutdown the entire filesystem, forcing you to take the unplanned downtime.