Linux tools to find duplicate files

diff()fileslinux

I have a large and growing set of text files, which are all quite small (less than 100 bytes). I want to diff each possible pair of files and note which are duplicates. I could write a Python script to do this, but I'm wondering if there's an existing Linux command-line tool (or perhaps a simple combination of tools) that would do this?

Update (in response to mfinni comment): The files are all in a single directory, so they all have different filenames. (But they all have a filename extension in common, making it easy to select them all with a wildcard.)

Best Answer

There's the fdupes. But I usually use a combination of find . -type f -exec md5sum '{}' \; | sort | uniq -d -w 36

Related Solutions

Apache – How to Deny Web Access to Certain Files

You could use Files/FilesMatch and a regular expression:

<Files ~ "\.txt$">
    Order allow,deny
    Deny from all
</Files>

This is how .htpasswd is protected.

or redirect any access of .txt to a 404:

RedirectMatch 404 \.txt$

Linux – How to Determine Filename Language Encoding

There's no 100% accurate way really, but there's a way to give a good guess.

There is a python library chardet which is available here: https://pypi.python.org/pypi/chardet

e.g.

See what the current LANG variable is set to:

$ echo $LANG
en_IE.UTF-8

Create a filename that'll need to be encoded with UTF-8

$ touch mÉ.txt

Change our encoding and see what happens when we try and list it

$ ls m*
mÉ.txt
$ export LANG=C
$ ls m*
m??.txt

OK, so now we have a filename encoded in UTF-8 and our current locale is C (standard Unix codepage).

So start up python, import chardet and get it to read the filename. I'm use some shell globbing (i.e. expansion through the * wildcard character) to get my file. Change "ls m*" to whatever will match one of your example files.

>>> import chardet
>>> import os
>>> chardet.detect(os.popen("ls m*").read())
{'confidence': 0.505, 'encoding': 'utf-8'}

As you can see, it's only a guess. How good a guess is shown by the "confidence" variable.

Best Answer

Related Solutions

Apache – How to Deny Web Access to Certain Files

Linux – How to Determine Filename Language Encoding

Related Topic