Linux – How to find duplicate files against a reference directory structure in Linux

cleanupdeduplicationfileslinux

There are a couple of duplicate file finders for Linux listed e.g. here. I have already tried fdupes and fslint. However, from what I have seen, these will find all duplicates of the selected directory-structures/search paths and thus also duplicates that exist inside only one of the search-paths (if you select multiple).

However what I need/want is to search for duplicates against a reference path, where I can define one path to be the reference path, and search inside the other path, for files that exist in the reference path in order to remove them.

I need to do this, to prepare two large directory-structure that have gotten out of sync, where one is more up-to-date than the other (this would be my reference). Most of the files should be duplicates between the two, but I suspect, that there are still some files only on the other path, so that I don't want to just remove it.

Are there perhaps some options to fdupes to achieve this, that I have overlooked?

I have tried writing a Python script to clean up the list that fdupes outputs, but not with success.

Best Answer

rmlint can do this:

rmlint --types=duplicates --must-match-tagged --keep-all-tagged <path1> // <path2>

This will find files in path1 which have duplicates (same data content) in path2. It will create a shell script which, if run, will remove the duplicates under path1, leaving only the unique files.