Reason for EXT4 file system corruption of Hyper-V guest

coreoscorruptionext4filesystemshyper-v

We had our second corruption of an ext4 partition in a relatively short time and ext4 is supposedly very reliable. As this is a virtual machine and the host providing the resources saw no disk errors or power-loss or such, I want to rule out hardware errors for now.

So I am wondering if we have such an unusual setup (a CoreOS guest under a Hyper-V host), such an unusual workload (Docker containers of Nginx, Gitlab, Redmine, MediaWiki, MariaDB) or a bad configuration. Any input / suggestion would be welcome.

The original error message (in the second instance) was:

Jun 05 02:00:50 localhost kernel: EXT4-fs error (device sda9): ext4_lookup:1595: inode #8347255: comm git: deleted inode referenced: 106338109
Jun 05 02:00:50 localhost kernel: Aborting journal on device sda9-8.
Jun 05 02:00:50 localhost kernel: EXT4-fs (sda9): Remounting filesystem read-only

At this point, an e2fsck run found lots of errors (didn't think to keep log) and placed about 357MB in lost+found for a 2TB partition with about 512GB data on it. The OS still bootes after this, so the lost parts seem to lie in user-data or docker containers.

Here are a few more details about the affected system:

$ uname -srm
Linux 4.19.123-coreos x86_64
$ sudo tune2fs -l /dev/sda9
tune2fs 1.45.5 (07-Jan-2020)
Filesystem volume name:   ROOT
Last mounted on:          /sysroot
Filesystem UUID:          04ab23af-a14f-48c8-af59-6ca97b3263bc
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype needs_recovery extent 64bit flex_bg inline_data sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags:         signed_directory_hash
Default mount options:    user_xattr acl
Filesystem state:         clean
Errors behavior:          Remount read-only
Filesystem OS type:       Linux
Inode count:              533138816
Block count:              536263675
Reserved block count:     21455406
Free blocks:              391577109
Free inodes:              532851311
First block:              0
Block size:               4096
Fragment size:            4096
Group descriptor size:    64
Reserved GDT blocks:      15
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         32576
Inode blocks per group:   1018
Flex block group size:    16
Filesystem created:       Tue Sep 11 00:02:46 2018
Last mount time:          Fri Jun  5 15:40:01 2020
Last write time:          Fri Jun  5 15:40:01 2020
Mount count:              3
Maximum mount count:      -1
Last checked:             Fri Jun  5 08:14:10 2020
Check interval:           0 (<none>)
Lifetime writes:          79 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:               128
Journal inode:            8
Default directory hash:   half_md4
Directory Hash Seed:      595db5c2-beda-4f32-836f-ee025416b0f1
Journal backup:           inode blocks

Update:

And a few more details about the host setup:

using Hyper-V Server 2016
the disk is based on a virtual disk file (as opposed to a physical disk)
the disk is setup to be dynamic (ie.e growing)
there are several snapshots/restore-points on the VM. I am not sure if this switches the disk image from dynamic to differencing(?)

Best Answer

What data orphaned inodes contains is a tricky enough problem. Why the storage system did such a thing is considerably more difficult.

First, do incident response. Check if any of these workloads is having unplanned downtime. Evaluate your recovery options: any DR environment on separate storage, backups, other copies of the data.

Consider making a backup of the VHD before changing anything. Allows undo of your actions, and perhaps you can let support examine the broken volume.

Identify what data is affected.

Run file on those lost inodes to guess their format. Open and examine their contents.
Run integrity checks on the application data.
- GitLab wraps git fsck in a task. Particularly relevant given the syslog message indicates a git binary accessed problem data.
- Run checks on your DBMS.

Check everything in the storage and compute systems.

Storage array volume status: online, free capacity
Health of individual physical disks
Search guest logs for every message relating to EXT4
Run Windows Best Practices Analyzer. In the comments, we found a recommendation not to use VHD dynamic disks.

There may not be an obvious cause. Even so, consider moving to a different system to rule out a hardware problem. If you have a DR system on different hardware, consider cutting over to that. Or try replacing smaller components, like disks in the array. Or migrate the VM to a different compute host.

Best Answer

Related Solutions

Linux – What does the mkfs.ext4 -G option do

Linux – Estimating the time needed for a resize2fs shrink

Related Topic