What tools do distributed programmers lack

distributedlanguage-agnostic

I have a dream to improve the world of distributed programming 🙂

In particular, I'm feeling a lack of necessary tools for debugging, monitoring, understanding and visualizing the behavior of distributed systems (heck, I had to write my own logger and visualizers to satisfy my requirements), and I'm writing a couple of such tools in my free time.

Community, what tools do you lack with this regard? Please describe one per answer, with a rough idea of what the tool would be supposed to do. Others can point out the existence of such tools, or someone might get inspired and write them.

Best Answer

OK, let me start.

A distributed logger with a high-precision global time axis - allowing to register events from different machines in a distributed system with high precision and independent on the clock offset and drift; with sufficient scalability to handle the load of several hundred machines and several thousand logging processes. Such a logger allows to find transport-level latency bottlenecks in a distributed system by seeing, for example, how many milliseconds it actually takes for a message to travel from the publisher to the subscriber through a message queue, etc.

Syslog is not ok because it's not scalable enough - 50000 logging events per second will be too much for it, and timestamp precision will suffer greatly under such load.

Facebook's Scribe is not ok because it doesn't provide a global time axis.

Actually, both syslog and scribe register events under arrival timestamps, not under occurence timestamps.

Honestly, I don't lack such a tool - I've written one for myself, I'm greatly pleased with it and I'm going to open-source it. But others might.

P.S. I've open-sourced it: http://code.google.com/p/greg