Distributed Computing – Using Remote Heterogeneous Machines

cdistributed computingsockets

The way i am doing it now is using boost::asio TCP sockets handling everything manually with a main server that orchestrates the processes between the available machines, but the number of machines is increasing and when i need communication between specific machines i have to do it through the server and the number of machines is just to much to be handled by one server, so i am thinking about Open MPI.

However i have 3 problems

  1. the machines are Heterogeneous.
  2. the number of machines available can be for example 50 at a time and also can be only 10 and sometimes it gets to 300 which is too much for my server to handle and there is always tons of data to process, and i can seem to utilize more than ~70 connections at the same time.
  3. most of the machines are remote and some of them share the same
    network.

if i want to scale things with my current design i would get ugly trees with multiple layers, i don't have the money to hire network experts/programmers for a non profit project, but i am fairly good at grasping new concepts, so how would you go around this?

and with the above problems is OpenMPI for me?

In other words, i am looking for a better design than mine, and i have no problem implementing it from scratch no matter how long it takes and i will be supporting linux only at the start because i figured that native code has a measurable advantage, i would like to hear your ideas.

Best Answer

There are a number of important missing bits of information:

  • Why is OpenMPI relevant?
  • Why is heterogeneous relevant?
  • What is the work that the server is orchestrating?
  • You mention one thread per connection is a problem.

If you're using Boost::Asio and you're having the problem of one thread per client, then you're likely doing something wrong. Make sure that you're doing things asynchronously and not blocking any one thread too long. This is probably the easiest thing for you to do at the moment.

OpenMPI is not what you need. MPI was designed originally for distributed memory architectures. It is possible to use it over the internet, but it is probably not the best choice. The fact you mention heterogeneous and MPI makes me think that you've done a CS course in HPC. That is not relevant in this case.

When deciding on a protocol and architecture, the type work that the clients are doing is important, how you share state between the clients and server and how long each message takes to process.

If you're optimizing for throughput, then a per message overhead is less important, so verbose serialization formats like XML and JSON are ok. You can even tear down connections between message processing jobs - meaning that the server does not have to maintain a thread.

If you're trying to keep things low latency, then per message overhead is important, so maintaining a connection and using terse serialization formats.

A Messages Broker is an architectural pattern that you could consider. It is responsible for distributing your messages between your applications.

You're better off using internet-friendly protocols, like HTTP. You don't want to worry about Proxies or NAT traversal. The internet can be considered a massive distributed system with millions of heterogeneous clients. REST is an architectural style inspired by that so it might be appropriate.

I'm going to assume that:

  • the server produces work item for clients to do and aggregates responses.
  • your clients receives a message, does some work and sends a response back.

My default choice of technology, would be a web server to implement this. It would have two URLs. One where the clients can GET new work, the other for clients to POST responses.

When you initially start the client it should periodically poll for new work (you could use WS or Long Poll HTTP requests). When the server returns a work item, one client should consume it. Each work item should be stamped with a unique id, so that you can ensure that their are no duplicates and correlate responses with the initial job.

When the client has completed the task it should POST the response and restart the process of polling for new work.

Web Servers don't work well for long requests so you would want to hand it off to something else, perhaps via a DB, IPC or a message queue.

The server then does not have to worry about the capabilities of each client (so it makes heterogeneous irrelevant). The clients will just consume at the rate they can.

Related Topic