Getting a simple list of changed files from unison

amazon s3unison

I have a filesystem that is changed on two servers and also needs to be replicated to Amazon S3.

Until recently, syncing the filesystem between the two servers using Unison, and then copying to S3 with s3sync.rb has been a fine solution.

Now that the filesystem is nearly 50GB, s3sync.rb has become the bottleneck, as it needs to check each file for freshness (we use the –no-md5 flag).

So I now have a script that expects a list of files, and it would update these and only these using s3cmd.rb

I'd expected that I could use the unison.log file to get a canonical list of files to pass, but the format of it varies depending on the operation that occurred to a file (new file, copy from local alternative, rename etc).

Is unison able to generate a log or list of files that have been changed other than that left in unison.log?

At the moment this is how I'm extracting the list of file from the unison.log (I'm deliberately ignoring deletes)

# Ignore deletes and get the list of new & changed files
grep -v '\[END\] Deleting ' /tmp/unison.log | grep '\[END\]' $unisonlog | sed -re 's/\[END\] (Copying|Updating file) //' > /tmp/changed-files.log

# Files that unison lists as shortcuts are harder as it doesn't always prefix them with their full path
# so before adding them to the log, find the files in the relevant directory
grep 'Shortcut: copying ' /tmp/unison.log | sed -re 's/Shortcut: copying (.*)+ from local file.*/\1/' | while read file
do
  echo "Having to look for $file in source directory"
  find /ebs/src -wholename "*$file" >> /tmp/changed-files.log
done

Best Answer

One idea would be to use the stdout generated by Unison while it is running. There is some junk in stdout that Unison uses to create a dynamic effect in the terminal while it is "looking for changes." This junk can be removed fairly easily by deleting every line containing a Carriage Return (CR) character (in vim this would be something like :%s/^.*^M.*$\n//g where ^M is entered by pressing Crtl+V then Crtl+M). The result looks something like

         <---- new dir    bar/foo/newdir   
deleted  ---->            bar/user/oldfile1  
deleted  ---->            bar/user/oldfile2  
         <---- new file   foobar/test/quiz.txt
         <---- changed    foobar/test/quiz.txt

This is much more easily parsed than Unison's default log.


A better idea though might be to forget parsing Unison's output altogether and instead use inotifywait. You can set up inotifywait to watch a certain directory and report any files that are changed, moved, created, etc.

inotifywait --event modify,attrib,move,create,delete  \
            --daemon                                  \
            --outfile /path/to/output.log             \
            --recursive                               \
            --quiet                                   \
            --format %w%f                             \
            '/watch/directory/'                   

This will run inotifywait as a daemon and produce a very nice, continuously updating list (output.log) of the absolute paths of all files in /watch/directory/ on which one of the specified events occurred. You will probably need to change the given events and/or utilize the --exclude option to get exactly the list of files you want to sync with S3.

Related Topic