After few days of intense trial and errors, I'm glad to be able to say that I've understood where the bottleneck was, and I'll post it here so that other people can benefit from my findings.
The problem lies in the pub/sub connections that I was using with socket.io, and in particular in the RedisStore used by socket.io to handle inter-process communication of socket instances.
After realizing that I could implement easily my own version of pub/sub using redis, I decided to give it a try, and removed the redisStore from socket.io, leaving it with the default memory store(I don't need to broadcast to all connected clients but only between 2 different users connected possibly on different processes)
Initially I declared only 2 global redis connections x process for handling the pub/sub on every connected client, and the application was using less recources but I was still being affected by a constant CPU usage growth, so not much had changed. But then I decided to try to create 2 new connections to redis for each client to handle their pub/sub only on their sessions, then close the connections once the user disconnected. Then after one day of usage in production, the cpu's were still at 0-5%...bingo! no process restarts, no bugs, with the performance I was expecting to have. Now I can say that node.js rocks and am happy to have choosen it for building this app.
Fortunately redis has been designed to handle many concurrent connections(differently by mongo) and by default it's set at 10k, that leaves room for around 5k concurrent users, on a single redis instance, which is enough for the moment for me, but I've read that it can be pushed up to 64k concurrent connections, so this architecture should be solid enough I believe.
At this point I was thinking to implement some sort of connection pools to redis, to optimize it a little further, but am not sure if that won't cause again the pub/sub events to build up on the connections, unless each of them is destroyied and recreated each time, to clean them.
Anyway, thanks for your answers, and I'll be curious to know what you think, and if you have any other suggestion.
Cheers.
Back when that article was written, MongoDB's default install was set to make unsafe writes. This performed great in benchmarks, but not so great in saving your data correctly.
https://aphyr.com/posts/284-call-me-maybe-mongodb
Up until recently, clients for MongoDB didn't bother to check whether or not their writes succeeded, by default: they just sent them and assumed everything went fine. This goes about as well as you'd expect.
https://blog.rainforestqa.com/2012-11-05-mongodb-gotchas-and-how-to-avoid-them/
MongoDB allows very fast writes and updates by default. The tradeoff is that you are not explicitly notified of failures. By default most drivers do asynchronous, ‘unsafe’ writes - this means that the driver does not return an error directly, similar to INSERT DELAYED with MySQL. If you want to know if something succeeded, you have to manually check for errors using getLastError.
For cases where you want an error thrown if something goes wrong, it’s simple in most drivers to enable “safe” queries which are synchronous. This makes MongoDB act in a familiar way to those migrating from a more traditional database.
Chances are you're seeing the results of safer default settings.
Best Answer
"Generally speaking", there is no number you should keep your CPU under. If you have a quad-core system and your application is performing well at a load average of 12, then more power to you. You really need to test and determine these things yourself. Much more is reflected in CPU load averages than just what the CPU is doing.