Linux – Storing duplicate files efficiently on linux

filesystemslinuxubuntu-12.04

I host a lot of websites and our system makes it easy to duplicate items in these sites which is handy, but leads to lots of duplicated (and potentially quite large) files. I was wondering if these is any mechanism in linux (specifically Ubuntu) where the filesystem will only store the file once but link to it from all its locations.

I'd need this to be transparent, and also handle the case that if a user changes one of the files, it doesn't alter the contents of the main file but creates a new file for just this particular instance of the file.

The point of the exercise is to reduce wasted space used by duplicated files.

Best Answer

I'd need this to be transparent

ZFS-on-Linux × feature called "on-line deduplication".

UPD.: I've re-read your question once again now it looks like Aufs can be of help for you. It's very popular solution for hosting environments. And actually I can mention Btrfs by myself now as well — the pattern is you have some template sub-volume which you snapshot every time you need another instance. It's COW, so only changed file blocks would need more space. But keep in mind, Btrfs is, ergh… well, not too stable anyways. I'd use it in production only if data on it are absolutely okay to be gone.

Best Answer

Related Solutions

Any reason not to use 512-bytes clusters for NTFS

Tips on efficiently storing 25TB+ worth million files in filesystem

Related Topic