Web-development – Design of high performance file processing web application

big datadesignhigh performanceweb-development

I'm trying to design a web app with ability to scale but can't wrap my heads around few concepts. I want to design it right but im not a experienced programmer, i have more of a system engineering background.

Basic architecture look like this:

Web Server -> File processing server -> NoSQL DB -> Search server

So the main scenario is follows:

  • User uploads a file via site
  • File is send for processing to a server(python script)
  • Results of processing is send to NoSQL DB
  • Results processed by search server and returned to user

We can scale web frontends via load balancing. Something like nginx+apache.
Database scaling is taking care by Cassandra or MongoDB. Search scaling is taking care by elasticsearch or sphinx clustering.

Now I want to be able to add multiple file processing servers in case file uploaded is too big. So I need to somehow split file into chunks and process it simultaneously on multiple nodes plus if node goes down while working it shouldn't affect anything and data must be saved. So I need something else which will be allocating tasks to my file processing servers, balancing load and control execution of tasks.

How to design custom applications for that kind of things? Should I use message queuing?

Best Answer

Computer power is cheap nowadays. Moreover, you don't know yet where the bottleneck will be.

To me, this smells like premature optimization where you worry about performance before even having the load. Perhaps you should just start by making it work, then about scaling it. My 2 cents.

The question is also if you want quick processing time or high throughput. If the processing is really resource/time intensive, it makes sense to split the file, distribute it and merge the results. However, these of course come at a certain cost: splitting, sending, scheduling outputs, merging, handling part failures. These taks consume resources too and adds lots of complexity. Distributed computation is only suited for appropriate tasks. Computing a single task per server is sometimes more efficient than doing all this stuff.

Related Topic