How to Notify Users of Changes During Collaborative Editing

Architecturedesignmicroservicesscalabilitywebsockets

"Designing Google Docs" is a popular system design interview question, and there are plenty of articles on it on the web. However, those mostly touch on merging changes, but don't explain how to swiftly notify a user of a change made by another user in the same document. Supposedly, for collaborative editing getting notified of changes as fast as possible is one of the main requirements. So my question is, what design would allow to quickly notify all connected users of a document that another user just made a change? This design should meet the following requirements:

  • Deliver change notifications within seconds
  • Be scalable (up to 100 users can edit the document at the same time; no limit on the number of documents being concurrently edited in the system)
  • Be fault-tolerant

Change merging strategies are out of the scope of this question.


That was the question, now here are some of my thoughts and what I came up with:

The user client needs to have an open Websocket/longpolling connection to quickly get new changes. If this connection is handled by any random node, we have several ways of checking for new changes in the document:

  • Polling the database/some cache for new changes — puts unnecessary load on the system, is not going to be fast, doesn't scale
  • Running the changes through a single message queue and processing it on each node to notify users connected to node — doesn't scale, as the time between enqueuing a message and processing it would grow with the number of documents being edited at the same time

Neither of these two ideas seem to work, so what if user connections for a particular document are handled by a node designated to that document? Say, users working with documents D1, D2, D3 are routed to node N1, and users working with documents D4, D5 are routed to node N2. In this case a change made in a document by one of the collaborators can be sent to all the other collaborators as soon as this change is validated, merged and persisted to the database, because all collaborators are connected to the same node. Now, this design requires a separate registry service, which would:

  • Be aware which documents are being handled by which node
  • Designate a node to handle a document if this document isn’t currently being handled by any node
  • Do healthchecks against nodes to quickly reassign documents from a failing node to a healthy one

So, when a user wants to interact with a document, the user client makes a request to establish a connection, the node for the requested document is looked up in the registry, and the connection is established to this node.

Fault tolerance:

  • The client that makes a change gets notified once the change is persisted and acknowledged. If the client connection is broken, all unacknowledged changes get resent on reconnect.
  • If the client connection is broken, the client gets all the latest changes on reconnect.

Thus if the node unexpectedly shuts down, the routing service assigns the document to a new node, all users switch connections to that new node, get all the latest persisted changes, and unacknowledged changes are resent to the new node.

I think this design can work and is scalable, since it’s possible to introduce as many nodes as needed, spreading documents as thinly as needed between these nodes. However, I’m skeptical of the fact that instead of nodes that can handle incoming connections for any document I now have special-purpose nodes that can’t be killed without some hassle of switching all connected users to new nodes.

How do you find this design? What design would you advise?

Best Answer

To solve the question of how Google Docs works:

Here is a Blog post by the guy who designed the syncing system of gdocs: https://neil.fraser.name/writing/sync/

The link point to a detailed writeup of a syncing system that can work in lossy enviroment like Internet.

the entire sync architeture algorithm shown, looks complex but it just has many parts to it

The whole process is just a giant diff-match-Patch cycle between clients and the Server farm(as one Server is unable to handle all load), controlled and timed by the client.

More Details can be found in the linked document.

Answer

The adaptive Timing(7) part of the linked document explains that the system works in a Timerange, (as example given, at most every 1 s, minimal every 10 s) to get changes pulled to the client.

The main request volume driver is the adaptive Timing, where even at a very long timescales with an sufficent amount of documents in circulation, every connected client needs to check if any changes happen in the meantime on their documents. Everything else is just bootstrapping this process(on the client) or handling the resulting packets( server side).

As always in system design, find the main request count driver and design your system for its handling and ignore the rest in the design calculations.

The rest is just the interface design and server side request handling and storage.

Related Topic