Linux filesystem or CDN for millions of files with replication

amazon s3distributed-filesystemsfilesystemslinux

Please tell me solution for this scenario:

  • several millions files, located in one directory ("img/8898f6152a0ecd7997a68631768fb72e9ac2efe1_1.jpg")
  • ~80k file size in average
  • 90% random read access
  • backup (replication) to other servers (every 5 minutes or immediately)
  • metadata of images saves into database

When number of files have become greater than 2 millions, we got a problem with slow random access time.
File system is ext3 with noatime and dir_index options, but there is no need to use commands like 'ls' or 'find'.

Solutions which I consider possible:

  1. stay with ext3 and simply convert directory tree structure to "img/889/8f6/152/a0ecd7997a68631768fb72e9ac2efe1_1.jpg"
  2. migrate to other file system (ReiserFS, XFS, EXT4 etc.)
  3. setup storage engine with distributed filesystem (give examples)
  4. or maybe other…

If we choose 1 or 2, how are we to replication? rsync can not handle with such a lot of datas on ext3 file system.

The best solution for us is to use Amazon S3, but this is too expensive with our traffic… Maybe you recommend some analogs (cheap CDN or open-source project)

Best Answer

Millions of files in one directory is bad design and will be slow. Subdivide them into directories with smaller number of entries.

Take a look at https://unix.stackexchange.com/questions/3733/number-of-files-per-directory

Use RAID and /or SSDs. This will not in itself solve the slow access times, but if you introduce multiple directories and reduce the number of files per directory, say by an order of magnitude or two, it will help to prevent hotspots.

Consider XFS, especially when using multiple drives and multiple directories, it may give you nice gains (see e.g. this thread for options to use. It give some tips for XFS on md RAID).

Related Topic