Linux – How to compare two directories to compare missing files, when the directories don’t have the same structure

diff()linux

I've been sent a HDD of new and updated files from an organisation that we are working with, but we already have most of the files sitting on our servers, and would like to update our local versions to match theirs.

Normally, this would be a job for something like rsync, but our problem is that the directory structure they provide is very poorly organised and we've had to rearrange their files in the past to work best with our systems.

So, my question is:

How can I find out which files in the set they have provided are new
or different to the versions that we have, when the directory
structures are different?

Once that question is answered, we can update the changed files, and work out where to put the new files on our system, probably somewhat manually.

Best Answer

Ok, here is my first attempt at something. It seems to work moderately well for what I need, but I am open to better suggestions:

First, get md5sums of all the files in both our filesystem and the new data:

find /location/of/data -type f -exec md5sum {} ';' > our.md5sums
find /media/newdisk -type f -exec md5sum {} ';' > their.md5sums

And I wrote a short python script called md5diff.py:

#!/usr/bin/env python
import sys
print "Comparing", sys.argv[1], "to", sys.argv[2]

# Create a dictionary based upon the hashes in source B
dict = {}
for line in open(sys.argv[2]):
    p = line.partition(' ')
    dict[p[0]] = p[2].strip()


# Now go through source A and report where the file is in source B
for line in open(sys.argv[1]):
    p = line.partition(' ')
    if p[0] in dict:
        print line.strip(), "(", sys.argv[2], ":",dict[p[0]], ")"
    else:
        print line.strip(), "NOT IN", sys.argv[2]

So now I can use

./md5diff.py their.md5sums our.md5sums

And if I add in a | grep "NOT IN" it will only list the files on their media that we don't already have (or is different from what we have). From their I can start to manually attack the known differences.