Join large overlapping files

I am trying to recover a (MySQL) database from a crashed disk. There are a number of recent dumps, which are corrupted bz2 files. Since the database does not change often, the dumps should be nearly identical. bzip2recover recovered about 70-80% of the chunks from the files, so most if not all of the data could be recovered by finding the overlaps in the files and joining them together. For example:

dump1: |-----------------|xxxxxxxxxxxxxxxx|------------------|
dump2: |-------------|----------------|xxxxxxxxxxxxxxxxxxxxxx|
dump3: |xxxxxxxxxxxxxxxxxxxxxx|---------------|xxxxxxxxxxxxxx|

here I can detect that the first chunk in dump1 is continued by the second one in dump2, which is continued by the second in dump3, which is continued by the third in dump1. By joining these four files, I have recovered the data.

The problem is that there are thousands of files (I have ten dumps of ~400 1M chunks each). Is there a tool which could automate this process, or at least parts of it (like a linux command checking for the longest overlap between the end of one file and the start of another)?

#!/usr/bin/env python import sys overlap_size = 100000000 # 100MB a = file(sys.argv[1]).read() b = file(sys.argv[2]).read() end = a[-overlap_size:] offset = b.find(end) c = file(sys.argv[3], 'wb') c.write(a[:-overlap_size]) c.write(b[offset:]) c.close()

Best Answer

Related Topic

Best Answer

Related Solutions

Powershell – How to Zip and Unzip Files

Data Recovery on faulty sata drive that is not being properly detected in Windows

Related Topic