Scalability – Best Way to Design File Storage for Image Processing System

image processingscalability

Would appreciate any recommendations regarding design a high-load image processing service.

There are lots (tens of hundreds of relatively large images (~10…20 Mb each)) that to be processed using command line utils when requested.

Generally: request->image processing->result. Obviously, the system must be able to scale to accommodate considerable load.

My concern and I believe the core of the system design a file storage:

  • Using one single network file storage (file share, samba or something like this) seems ineffective, as this one and only network storage is a bottleneck.

  • Having multiple instances of the same file storages on each node (and somehow replicate files) is quire expensive.

  • Storing images in db? For me sounds virtually unusable for large files and command line processing.

I considered more complex designs like having one large storage and cache frequently used files locally; but it seems a bit too complicated and not reliable.

Are there any best practices or patterns for problem like this one?

Best Answer

You basically have two "directions" in which you can scale file system storage: vertically and horizontally.

Vertical scaling is basically just making a drive faster. The obvious move here is from single spinning disc to SSD, possibly in a RAID. This lets you create a file system that has higher bandwidth and throughput. Back before SSDs were common, you might have considered using a number of actual hard drives, but in a RAID to get more bandwidth out of what shows up to the user as a single logical drive. This can still be worth considering if you need more storage than you're willing to buy SSDs for.

Horizontal scaling in this case is a matter of having a number of separate "file systems" that look like a single file system. Fortunately you're dealing with files that are individually small enough to plan on just storing each on an individual file system (unlike a database that might need a single file of a hundred terabytes or more). This means you can scale horizontally fairly easily by (for example) having a number of servers that each store a specified subset of the files you care about, and use (for example) a hash of the file name to determine which server will store each file.

So, to use this you'd basically have a number of file servers, and a list of files to process. If you're going to have multiple clients processing the files, split the list between them. Each client goes through the names in its part of the list, hashes the name to find the server for that file, and retrieves/processes the file.

Depending on the network and what you're doing with the result of the processing, there's another possibility to consider: start with the same set of servers and the same input list. Hash the names, and send each name directly to the server where that file is stored. Instead of retrieving the file and sending it over the network, the server retrieves and processes the file itself.

The latter can have a significant advantage if the result of processing should be stored together with the original file and/or you may have to deal with a relatively slow network (since it only requires transmitting the file names over the network, which we normally expect to be a lot smaller than the file contents).

Related Topic