Linux – Storing and backing up 10 million files on Linux

ext3ext4filesystemslinuxstorage

I run a website where about 10 million files (book covers) are stored in 3 levels of subdirectories, ranging [0-f]:

0/0/0/
0/0/1/
...
f/f/f/

This leads to around 2400 files per directory, which is very fast when we need to retrieve one file. This is moreover a practice suggested by many questions.

However, when I need to backup these files, it takes many days just to browse the 4k directories holding 10m files.

So I'm wondering if I could store these files in a container (or in 4k containers), which would each act exactly like a filesystem (some kind of mounted ext3/4 container?). I guess this would be almost as efficient as accessing directly a file in the filesystem, and this would have the great advantage of being copied to another server very efficiently.

Any suggestion on how to do this best? Or any viable alternative (noSQL, …) ?

Best Answer

Options for quickly accessing and backing up millions of files

Borrow from people with similar problems

This sounds very much like an easier sort of problem that faces USENET news servers and caching web proxies: hundreds of millions of small files that are randomly accessed. You might want to take a hint from them (except they don't typically ever have to take backups).

http://devel.squid-cache.org/coss/coss-notes.txt

http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=4074B50D266E72C69D6D35FEDCBBA83D?doi=10.1.1.31.4000&rep=rep1&type=pdf

Obviously the cyclical nature of the cyclic news filesystem is irrelevant to you, but the lower level concept of having multiple disk files/devices with packed images and a fast index from the information the user provides to look up the location information is very much appropriate.

Dedicated filesystems

Of course, these are just similar concepts to what people were talking about with creating a filesystem in a file and mounting it over loopback except you get to write your own filesystem code. Of course, since you said your system was read-mostly, you could actually dedicate a disk partition (or lvm partition for flexibility in sizing) to this one purpose. When you want to back up, mount the filesystem read-only and then make a copy of the partition bits.

LVM

I mentioned LVM above as being useful to allow dynamic sizing of a partition so that you don't need to back up lots of empty space. But, of course, LVM has other features which might be very much applicable. Specifically the "snapshot" functionality which lets you freeze a filesystem at a moment in time. Any accidental rm -rf or whatever would not disturb the snapshot. Depending on exactly what you are trying to do, that might be sufficient for your backups needs.

RAID-1

I'm sure you are familiar with RAID already and probably already use it for reliability, but RAID-1 can be used for backups as well, at least if you are using software RAID (you can use it with hardware RAID, but that actually gives you lower reliability because it may require the same model/revision controller to read). The concept is that you create a RAID-1 group with one more disk than you actually need connected for your normal reliability needs (eg a third disk if you use software RAID-1 with two disks, or perhaps a large disk and a hardware-RAID5 with smaller disks with a software RAID-1 on top of the hardware RAID-5). When it comes time to take a backup, install a disk, ask mdadm to add that disk to the raid group, wait until it indicates completeness, optionally ask for a verification scrub, and then remove the disk. Of course, depending on performance characteristics, you can have the disk installed most of the time and only removed to exchange with an alternate disk, or you can have the disk only installed during backups).

Related Solutions

Storing a million images in the filesystem

I'd recommend using a regular file system instead of databases. Using file system is easier than a database, you can use normal tools to access files, file systems are designed for this kind of usage etc. NTFS should work just fine as a storage system.

Do not store the actual path to database. Better to store the image's sequence number to database and have function that can generate path from the sequence number. e.g:

 File path = generatePathFromSequenceNumber(sequenceNumber);

It is easier to handle if you need to change directory structure some how. Maybe you need to move the images to different location, maybe you run out of space and you start storing some of the images on the disk A and some on the disk B etc. It is easier to change one function than to change paths in database.

I would use this kind of algorithm for generating the directory structure:

First pad you sequence number with leading zeroes until you have at least 12 digit string. This is the name for your file. You may want to add a suffix:
- 12345 -> 000000012345.jpg
Then split the string to 2 or 3 character blocks where each block denotes a directory level. Have a fixed number of directory levels (for example 3):
- 000000012345 -> 000/000/012
Store the file to under generated directory:
- Thus the full path and file filename for file with sequence id 123 is 000/000/012/00000000012345.jpg
- For file with sequence id 12345678901234 the path would be 123/456/789/12345678901234.jpg

Some things to consider about directory structures and file storage:

Above algorithm gives you a system where every leaf directory has maximum of 1000 files (if you have less that total of 1 000 000 000 000 files)
There may be limits how many files and subdirectories a directory can contain, for example ext3 files system on Linux has a limit of 31998 sub-directories per one directory.
Normal tools (WinZip, Windows Explorer, command line, bash shell, etc.) may not work very well if you have large number of files per directory (> 1000)
Directory structure itself will take some disk space, so you'll do not want too many directories.
With above structure you can always find the correct path for the image file by just looking at the filename, if you happen to mess up your directory structures.
If you need to access files from several machines, consider sharing the files via a network file system.
The above directory structure will not work if you delete a lot of files. It leaves "holes" in directory structure. But since you are not deleting any files it should be ok.

Linux – Maximum number of files in one ext3 directory while still getting acceptable performance

Provided you have a distro that supports the dir_index capability then you can easily have 200,000 files in a single directory. I'd keep it at about 25,000 though, just to be safe. Without dir_index, try to keep it at 5,000.