Keeping TCP connections alive to track which clients are online

connectionkeepalivetcp

I am developing an application where a server needs to stay in touch with lots of simple IoT devices. Nearly no information exchange is needed between the server and each device, but devices need to stay online and reachable by the server 24h. At some point (that happens very rarely) the server needs to be able to get in touch with one of the devices and exchange some messages: it is crucial, however, that those devices are reachable in a matter of a very short time.

This means that I need those client devices to be somehow continuously connected. Now, I wonder: is it feasible to just connect those devices via TCP and keep those connections alive to be always ready to exchange messages?

I have tried to read around and I always read the same answer: it depends on your implementation, since it is very likely that your message exchanging and processing will be the bottleneck rather than keeping those TCP connections alive. Now, this is not really my case, since I just need to exchange a very very limited amount of information every lots of time.

So is it reasonable to just keep those clients connected? Or should I devise a more efficient method? For example, how much bandwidth is required to just keep alive a TCP connection without any data exchange? And does this require a significant amount of memory or CPU?

I implemented a simple C++ program that sends UDP keep alives to my server every some seconds: as per my benchmarks, this can scale up to several millions online devices without any problem, even on a reasonably limited server. Will TCP perform worse than that?

Best Answer

As for my understanding of TCP, asserting "Keeping TCP connections alive" is misleading, as there is no TCP-protocol-specific mechanism dealing with timeout, when referred to ESTABLISHED connections. I mean: once established, they can last forever, until a RESET, a FIN or a timeout in receiving an ACK (...following some transmission to be ACKnowledged, in this last case) happens.

As for my experience, 100% of "suddenly broken due to idle timeout" sort of issues, depends on some intermediary router/firewall, along the routing path between the two communicating hosts. I mean: as the firewall tipically is a "statefull" firewall, it keeps track of connections it is firewalling/managing. As such, every connection it need to track means some degree of system resources (of the firewall, I mean) to be consumed. Also, the firewall knows perfectly which of the managed connections are "working" and which one, viceversa, are "idle", due to the very nature of the firewall itself (it's a stateful firewall!). As such, lots (all?) of the firewall implementations have a timeout defined and if the managed-connections are idle for such a timout value, the firewall send a reset to the both ends (...of the TCP connection) and frees its own resources.

Based on your question, I bet that the TCP connection will be opened by your IoT device (acting as a client) versus your controlling-server (the TCP server). Hence... LOTS, if not ALL, of the ADSL home router that will NAT your IoT device traffic, will surely act as described.

This, at least, based on my own experience.

But as I'm not Jon Postel, please don't blame me if I'm wrong :-)

As a side note: you wrote "...LOTS of simple IoT devices...". Please keep in mind that there is a very hard-limit in the number of concurrent TCP connections you can handle with your one-single-big server as.... TCP "port" is a 16bit values. So, for each IP address, you cannot exceed (by TCP intrinsic design) 64K connections. How this problems can be solved, it's out of scope, in the context of this question.

Finally, let me add that I really see no problem in implementing a sort of heartbeat protocol between you IoT device and the managing server/application. It can be implemented to be very "network-friendly", with no impact in terms of bandwidth and with lots of advantages, in terms of manageability/control.