Tips on efficiently storing 25TB+ worth million files in filesystem

compressionfilesystemslog-filesstorage

Say you are confronted with 25 TB worth uncompressed log files and you have at your disposal an array of 20 commodity boxes with collective free storage capacity of 25 TB.

How would you store these ?.

a) Which distributed file system to use ?

b) Which compression / decompression format/algorithm ?

c) Log file size is 1MB to max 7MB all text and lot of whitespace

d) Usage is
a) people want latest log files more than previous so what caching system to use
b) people will only read log files not delete them
c) people want listing of log files against a date range

e) Operating system running on the commodity boxes is Linux,

f) As for backup well we have a storage array that takes care of that. So ability to restore data from array exist.

I don't want them to access the file system directly. What should I do ? How do I get them a REST based API for this ?

Please spare you 2 cents and what would you do ?

Ankur

Best Answer

I'm not a distributed file system ninja, but after consolidating as many drives I can into as few machines as I can, I would try using iSCSI to connect the bulk of the machines to one main machine. There I could consolidate things into hopefully a fault tolerant storage. Preferably, fault tolerant within a machine (if a drive goes out) and among machines (if a whole machine is power off).

Personally I like ZFS. In this case, the build in compression, dedupe and fault tolerance would be helpful. However, I'm sure there are many other ways to compress the data while making it fault tolerant.

Wish I had a real turnkey distributed file solution to recommend, I know this is really kludgey but I hope it points you in the right direction.

Edit: I am still new to ZFS and setting up iSCSI, but recalled seeing a video from Sun in Germany where they were showing the fault tolerance of ZFS. They connected three USB hubs to a computer and put four flash drives in each hub. Then to prevent any one hub from taking the storage pool down they made a RAIDz volume consisting of one flash drive from each hub. Then they stripe the four ZFS RAIDz volumes together. That way only four flash drive were used for parity. Next of course the unplugged one hub and that degraded every zpool, but all the data was available. In this configuration up to four drive could be lost, but only if any two drive were not in the same pool.

If this configuration was used with the raw drive of each box, then that would preserve more drives for data and not for parity. I heard FreeNAS can (or was going to be able to) share drives in a "raw" manner via iSCSI, so I presume Linux can do the same. As I said, I'm still learning, but this alternate method would be less wasteful from drive parity stand point than my previous suggestion. Of course, it would rely on using ZFS which I don't know if would be acceptable. I know it is usually best to stick to what you know if you are going to have to build/maintain/repair something, unless this is a learning experience.

Hope this is better.

Edit: Did some digging and found the video I spoke about. The part where they explain spreading the USB flash drive over the hubs starts at 2m10s. The video is to demo their storage server "Thumper" (X4500) and how to spread the disks across controllers so if you have a hard disk controller failure your data will still be good. (Personally I think this is just a video of geeks having fun. I wish I had a Thumper box myself, but my wife wouldn't like me running a pallet jack through the house. :D That is one big box.)

Edit: I remembered comming across a distributed file system called OpenAFS. I hadn't tried it, I had only read some about it. Perhaps other know how it handles in the real world.

Related Solutions

Storing a million images in the filesystem

I'd recommend using a regular file system instead of databases. Using file system is easier than a database, you can use normal tools to access files, file systems are designed for this kind of usage etc. NTFS should work just fine as a storage system.

Do not store the actual path to database. Better to store the image's sequence number to database and have function that can generate path from the sequence number. e.g:

 File path = generatePathFromSequenceNumber(sequenceNumber);

It is easier to handle if you need to change directory structure some how. Maybe you need to move the images to different location, maybe you run out of space and you start storing some of the images on the disk A and some on the disk B etc. It is easier to change one function than to change paths in database.

I would use this kind of algorithm for generating the directory structure:

First pad you sequence number with leading zeroes until you have at least 12 digit string. This is the name for your file. You may want to add a suffix:
- 12345 -> 000000012345.jpg
Then split the string to 2 or 3 character blocks where each block denotes a directory level. Have a fixed number of directory levels (for example 3):
- 000000012345 -> 000/000/012
Store the file to under generated directory:
- Thus the full path and file filename for file with sequence id 123 is 000/000/012/00000000012345.jpg
- For file with sequence id 12345678901234 the path would be 123/456/789/12345678901234.jpg

Some things to consider about directory structures and file storage:

Above algorithm gives you a system where every leaf directory has maximum of 1000 files (if you have less that total of 1 000 000 000 000 files)
There may be limits how many files and subdirectories a directory can contain, for example ext3 files system on Linux has a limit of 31998 sub-directories per one directory.
Normal tools (WinZip, Windows Explorer, command line, bash shell, etc.) may not work very well if you have large number of files per directory (> 1000)
Directory structure itself will take some disk space, so you'll do not want too many directories.
With above structure you can always find the correct path for the image file by just looking at the filename, if you happen to mess up your directory structures.
If you need to access files from several machines, consider sharing the files via a network file system.
The above directory structure will not work if you delete a lot of files. It leaves "holes" in directory structure. But since you are not deleting any files it should be ok.

Linux – Storing and backing up 10 million files on Linux

Options for quickly accessing and backing up millions of files

Borrow from people with similar problems

This sounds very much like an easier sort of problem that faces USENET news servers and caching web proxies: hundreds of millions of small files that are randomly accessed. You might want to take a hint from them (except they don't typically ever have to take backups).

http://devel.squid-cache.org/coss/coss-notes.txt

http://citeseer.ist.psu.edu/viewdoc/download;jsessionid=4074B50D266E72C69D6D35FEDCBBA83D?doi=10.1.1.31.4000&rep=rep1&type=pdf

Obviously the cyclical nature of the cyclic news filesystem is irrelevant to you, but the lower level concept of having multiple disk files/devices with packed images and a fast index from the information the user provides to look up the location information is very much appropriate.

Dedicated filesystems

Of course, these are just similar concepts to what people were talking about with creating a filesystem in a file and mounting it over loopback except you get to write your own filesystem code. Of course, since you said your system was read-mostly, you could actually dedicate a disk partition (or lvm partition for flexibility in sizing) to this one purpose. When you want to back up, mount the filesystem read-only and then make a copy of the partition bits.

LVM

I mentioned LVM above as being useful to allow dynamic sizing of a partition so that you don't need to back up lots of empty space. But, of course, LVM has other features which might be very much applicable. Specifically the "snapshot" functionality which lets you freeze a filesystem at a moment in time. Any accidental rm -rf or whatever would not disturb the snapshot. Depending on exactly what you are trying to do, that might be sufficient for your backups needs.

RAID-1

I'm sure you are familiar with RAID already and probably already use it for reliability, but RAID-1 can be used for backups as well, at least if you are using software RAID (you can use it with hardware RAID, but that actually gives you lower reliability because it may require the same model/revision controller to read). The concept is that you create a RAID-1 group with one more disk than you actually need connected for your normal reliability needs (eg a third disk if you use software RAID-1 with two disks, or perhaps a large disk and a hardware-RAID5 with smaller disks with a software RAID-1 on top of the hardware RAID-5). When it comes time to take a backup, install a disk, ask mdadm to add that disk to the raid group, wait until it indicates completeness, optionally ask for a verification scrub, and then remove the disk. Of course, depending on performance characteristics, you can have the disk installed most of the time and only removed to exchange with an alternate disk, or you can have the disk only installed during backups).