Deduplication – Cheap and Fast Methods with Hardlinks

deduplicationext4

I've got shared hosting with a few thousand WordPress installs and I've wanted for ages to have a nice way of removing all the duplicate files in a sensible and secure way.
I'm looking for better disk cache hit ratios and simpler backups.

I'm just using standard Ext4, not something like ZFS which has it built in (at a cost).

I'm familiar with tools like rdfind is almost perfect.
It can scan over all the files, find the duplicates and hard link them together.
I could run it on a weekly cron at off peak times thus making the cost virtually zero.

The problem is I want a single account changing a file to destroy the hard link and give it's own copy of the file again. This way one site updating WordPress or a plugin wouldn't mess with any other sites. That would also remove potential security issues as well since no account would be able to tamper with another account's files.
Sort of Copy-on-write for links.

Is anything like this possible? I've tried doing some searches but I haven't been able to find anything.

Best Answer

It looks like the best solution for efficient 'offline' deduplication is BTRFS reflinks.

That keeps the links 'destructible' if something tried to change a file (E.g. a Wordpress update) and so security and ease of use of the platform is maintained.

Thanks @bitinerant for pointing that option out. I'll be doing further experiments to see if it's worth migrating for my particular scenario. The fact I can migrate EXT4 to Btrfs makes it a lot more feasible than ZFS or similar.

Related Topic