Folks please help – I am a newb with a major headache at hand (perfect storm situation).
I have a 3 1tb hdd on my ubuntu 11.04 configured as software raid 5. The data had been copied weekly onto another separate off the computer hard drive until that completely failed and was thrown away. A few days back we had a power outage and after rebooting my box wouldn't mount the raid. In my infinite wisdom I entered
mdadm --create -f...
command instead of
mdadm --assemble
and didn't notice the travesty that I had done until after. It started the array degraded and proceeded with building and syncing it which took ~10 hours. After I was back I saw that that the array is successfully up and running but the raid is not
I mean the individual drives are partitioned (partition type f8
) but the md0
device is not. Realizing in horror what I have done I am trying to find some solutions. I just pray that --create
didn't overwrite entire content of the hard driver.
Could someone PLEASE help me out with this – the data that's on the drive is very important and unique ~10 years of photos, docs, etc.
Is it possible that by specifying the participating hard drives in wrong order can make mdadm
overwrite them? when I do
mdadm --examine --scan
I get something like ARRAY /dev/md/0 metadata=1.2 UUID=f1b4084a:720b5712:6d03b9e9:43afe51b name=<hostname>:0
Interestingly enough name used to be 'raid' and not the host hame with :0 appended.
Here is the 'sanitized' config entries:
DEVICE /dev/sdf1 /dev/sde1 /dev/sdd1
CREATE owner=root group=disk mode=0660 auto=yes
HOMEHOST <system>
MAILADDR root
ARRAY /dev/md0 metadata=1.2 name=tanserv:0 UUID=f1b4084a:720b5712:6d03b9e9:43afe51b
Here is the output from mdstat
cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md0 : active raid5 sdd1[0] sdf1[3] sde1[1]
1953517568 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
unused devices: <none>
fdisk shows the following:
fdisk -l
Disk /dev/sda: 80.0 GB, 80026361856 bytes
255 heads, 63 sectors/track, 9729 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000bf62e
Device Boot Start End Blocks Id System
/dev/sda1 * 1 9443 75846656 83 Linux
/dev/sda2 9443 9730 2301953 5 Extended
/dev/sda5 9443 9730 2301952 82 Linux swap / Solaris
Disk /dev/sdb: 750.2 GB, 750156374016 bytes
255 heads, 63 sectors/track, 91201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000de8dd
Device Boot Start End Blocks Id System
/dev/sdb1 1 91201 732572001 8e Linux LVM
Disk /dev/sdc: 500.1 GB, 500107862016 bytes
255 heads, 63 sectors/track, 60801 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00056a17
Device Boot Start End Blocks Id System
/dev/sdc1 1 60801 488384001 8e Linux LVM
Disk /dev/sdd: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x000ca948
Device Boot Start End Blocks Id System
/dev/sdd1 1 121601 976760001 fd Linux raid autodetect
Disk /dev/dm-0: 1250.3 GB, 1250254913536 bytes
255 heads, 63 sectors/track, 152001 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x00000000
Disk /dev/dm-0 doesn't contain a valid partition table
Disk /dev/sde: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0x93a66687
Device Boot Start End Blocks Id System
/dev/sde1 1 121601 976760001 fd Linux raid autodetect
Disk /dev/sdf: 1000.2 GB, 1000204886016 bytes
255 heads, 63 sectors/track, 121601 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disk identifier: 0xe6edc059
Device Boot Start End Blocks Id System
/dev/sdf1 1 121601 976760001 fd Linux raid autodetect
Disk /dev/md0: 2000.4 GB, 2000401989632 bytes
2 heads, 4 sectors/track, 488379392 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 524288 bytes / 1048576 bytes
Disk identifier: 0x00000000
Disk /dev/md0 doesn't contain a valid partition table
Per suggestions I did clean up the superblocks and re-created the array with --assume-clean
option but with no luck at all.
Is there any tool that will help me to revive at least some of the data? Can someone tell me what and how the mdadm –create does when syncs to destroy the data so I can write a tool to un-do whatever was done?
After the re-creating of the raid I run fsck.ext4 /dev/md0 and here is the output
root@tanserv:/etc/mdadm# fsck.ext4 /dev/md0
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext4: Superblock invalid, trying backup blocks…
fsck.ext4: Bad magic number in super-block while trying to open /dev/md0
The superblock could not be read or does not describe a correct ext2
filesystem. If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193
Per Shanes' suggestion I tried
root@tanserv:/home/mushegh# mkfs.ext4 -n /dev/md0
mke2fs 1.41.14 (22-Dec-2010)
Filesystem label=
OS type: Linux
Block size=4096 (log=2)
Fragment size=4096 (log=2)
Stride=128 blocks, Stripe width=256 blocks
122101760 inodes, 488379392 blocks
24418969 blocks (5.00%) reserved for the super user
First data block=0
Maximum filesystem blocks=0
14905 block groups
32768 blocks per group, 32768 fragments per group
8192 inodes per group
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872, 71663616, 78675968,
102400000, 214990848
and run fsck.ext4 with every backup block but all returned the following:
root@tanserv:/home/mushegh# fsck.ext4 -b 214990848 /dev/md0
e2fsck 1.41.14 (22-Dec-2010)
fsck.ext4: Invalid argument while trying to open /dev/md0
The superblock could not be read or does not describe a correct ext2
filesystem. If the device is valid and it really contains an ext2
filesystem (and not swap or ufs or something else), then the superblock
is corrupt, and you might try running e2fsck with an alternate superblock:
e2fsck -b 8193 <device>
Any suggestions?
Regards!
Best Answer
Ok - something was bugging me about your issue, so I fired up a VM to dive into the behavior that should be expected. I'll get to what was bugging me in a minute; first let me say this:
Back up these drives before attempting anything!!
You may have already done damage beyond what the resync did; can you clarify what you meant when you said:
If you ran a
mdadm --misc --zero-superblock
, then you should be fine.Anyway, scavenge up some new disks and grab exact current images of them before doing anything at all that might do any more writing to these disks.
That being said.. it looks like data stored on these things is shockingly resilient to wayward resyncs. Read on, there is hope, and this may be the day that I hit the answer length limit.
The Best Case Scenario
I threw together a VM to recreate your scenario. The drives are just 100 MB so I wouldn't be waiting forever on each resync, but this should be a pretty accurate representation otherwise.
Built the array as generically and default as possible - 512k chunks, left-symmetric layout, disks in letter order.. nothing special.
So far, so good; let's make a filesystem, and put some data on it.
Ok. We've got a filesystem and some data ("data" in
datafile
, and 5MB worth of random data with that SHA1 hash inrandomdata
) on it; let's see what happens when we do a re-create.The resync finished very quickly with these tiny disks, but it did occur. So here's what was bugging me from earlier; your
fdisk -l
output. Having no partition table on themd
device is not a problem at all, it's expected. Your filesystem resides directly on the fake block device with no partition table.Yeah, no partition table. But...
Perfectly valid filesystem, after a resync. So that's good; let's check on our data files:
Solid - no data corruption at all! But this is with the exact same settings, so nothing was mapped differently between the two RAID groups. Let's drop this thing down before we try to break it.
Taking a Step Back
Before we try to break this, let's talk about why it's hard to break. RAID 5 works by using a parity block that protects an area the same size as the block on every other disk in the array. The parity isn't just on one specific disk, it's rotated around the disks evenly to better spread read load out across the disks in normal operation.
The XOR operation to calculate the parity looks like this:
So, the parity is spread out among the disks.
A resync is typically done when replacing a dead or missing disk; it's also done on
mdadm create
to assure that the data on the disks aligns with what the RAID's geometry is supposed to look like. In that case, the last disk in the array spec is the one that is 'synced to' - all of the existing data on the other disks is used for the sync.So, all of the data on the 'new' disk is wiped out and rebuilt; either building fresh data blocks out of parity blocks for what should have been there, or else building fresh parity blocks.
What's cool is that the procedure for both of those things is the exact same: an XOR operation across the data from the rest of the disks. The resync process in this case may have in its layout that a certain block should be a parity block, and think it's building a new parity block, when in fact it's re-creating an old data block. So even if it thinks it's building this:
...it may just be rebuilding
DISK5
from the layout above.So, it's possible for data to stay consistent even if the array's built wrong.
Throwing a Monkey in the Works
(not a wrench; the whole monkey)
Test 1:
Let's make the array in the wrong order!
sdc
, thensdd
, thensdb
..Ok, that's all well and good. Do we have a filesystem?
Nope! Why is that? Because while the data's all there, it's in the wrong order; what was once 512KB of A, then 512KB of B, A, B, and so forth, has now been shuffled to B, A, B, A. The disk now looks like jibberish to the filesystem checker, it won't run. The output of
mdadm --misc -D /dev/md1
gives us more detail; It looks like this:When it should look like this:
So, that's all well and good. We overwrote a whole bunch of data blocks with new parity blocks this time out. Re-create, with the right order now:
Neat, there's still a filesystem there! Still got data?
Success!
Test 2
Ok, let's change the chunk size and see if that gets us some brokenness.
Yeah, yeah, it's hosed when set up like this. But, can we recover?
Success, again!
Test 3
This is the one that I thought would kill data for sure - let's do a different layout algorithm!
Scary and bad - it thinks it found something and wants to do some fixing! Ctrl+C!
Ok, crisis averted. Let's see if the data's still intact after resyncing with the wrong layout:
Success!
Test 4
Let's also just prove that that superblock zeroing isn't harmful real quick:
Yeah, no big deal.
Test 5
Let's just throw everything we've got at it. All 4 previous tests, combined.
Onward!
The verdict?
Wow.
So, it looks like none of these actions corrupted data in any way. I was quite surprised by this result, frankly; I expected moderate odds of data loss on the chunk size change, and some definite loss on the layout change. I learned something today.
So .. How do I get my data??
As much information as you have about the old system would be extremely helpful to you. If you know the filesystem type, if you have any old copies of your
/proc/mdstat
with information on drive order, algorithm, chunk size, and metadata version. Do you have mdadm's email alerts set up? If so, find an old one; if not, check/var/spool/mail/root
. Check your~/.bash_history
to see if your original build is in there.So, the list of things that you should do:
dd
before doing anything!!fsck
the current, active md - you may have just happened to build in the same order as before. If you know the filesystem type, that's helpful; use that specificfsck
tool. If any of the tools offer to fix anything, don't let them unless you're sure that they've actually found the valid filesystem! If anfsck
offers to fix something for you, don't hesitate to leave a comment to ask whether it's actually helping or just about to nuke data./proc/mdstat
, then you can just mimic what it shows; if not, then you're kinda in the dark - trying all of the different drive orders is reasonable, but checking every possible chunk size with every possible order is futile. For each,fsck
it to see if you get anything promising.So, that's that. Sorry for the novel, feel free to leave a comment if you have any questions, and good luck!
footnote: under 22 thousand characters; 8k+ shy of the length limit