Does anyone know how the rainbow unicorns (Netflix, Amazon, Google, etc.) handle large files / data exchange between their services?
Unfortunately I do not know how they deal with such problems.
The problem is this - Does it make sense for all our microservices to be accepting this unique ID as part of their API for the purposes of interacting with documents or not?
It violates the Single Responsibility Principle, which should be inherently in your microservice's architecture. One microservice - logically one, physically many instances representing one - should be dealing with one topic.
In the case of your document store, you have one point, where all queries for documents go (of course you could split this logical unit up into multiple document stores for several kinds of documents).
If your "application" needs to work on a document, it asks the respective microservice and processes its result(s).
If another service needs an actual document or parts of it, it has to ask the document service.
One of the key contention points we are facing is how to communicate large quantities of data between our different services.
This is an architectural problem:
Decrease the need to transfer big amounts of data
Ideally, each service has all of it's data and needs no transfer to simply serve requests.
As an extension of this idea - if you have the need to transfer data, think of redundancy (*in a positive way_): Does it make sense to have the data redundant in many places (where they are needed)? Think of how possible inconsistencies might harm your processes. There is no transfer faster as actually none.
Decrease the size of the data itself
Think of how you could compress your data: Starting with actual compression algortihms up to smart data structures. The less goes over the wire, the faster you are.
Disclaimers
Our company runs applications on a Micro Service architecture that
includes thousands of services. I am working on a backend application
"X" that talks to 50+ services. Frontend services call my service "X"
to execute requests on other services.
First of all, thousands of random services don't make an architecture to be Microservices like architecture. It's still necessary a certain sense of a "whole" and a little bit of arrangement among services. Guidelines or rules of thumb.
Contextualize the backend within the 'whole'
I assume, this backend is neither gateway nor proxy. It has its own business and a well defined domain. So, regarding other services, 'X' is a facade to ease the access to this domain.
As a facade, hidding implementation details (as for instance, integrations) is among its responsibilities. No implementation detail should reach other services and this includes integration errors. Whatever happened in 'X', it's nobody business.
That said, it doesn't mean we cannot tell to the user that something went wrong. We can, but we do it abstracting the details. We won't give the sense of something remote is failing. Right the opposite, something in 'X' failed and that's it.
Since we are speaking about thousands of possible integrations (+50 atm), the number of possible and different errors is significant. If we map every single one to a custom message, the end-user is going to be overwhelmed by so many (and uncontextualized) information. If we map all the errors to a small set of custom errors, we are biasing the information, making hard for us to track the problem and solve it.
In my opinion, error messages should provide to the user with the sense that there's something we can do to amend the problem.
Nevertheless, if end-users still want to know what's going on under the hood, there are better ways. For example, logs.
Accountability
Other services do not return user-friendly messages. It is not possible for me to request changes by other teams as there are
several.There are no agreed error codes as such.
Other services return a string error message. Currently, it is passed back to the UI. Sometimes the error messages are a pointer
references (bad code :/)
As developer, your responsibility is to expose these arguments to the stakeholders. It's a matter of accountability. In my opinion, there's a leak of technical leadership and that's a real problem when it comes to distributed systems.
There's no technical envision. If there was, services would be implemented upon rules of thumb addressed to make the system scalable and ease the integrations among services. Right now looks like services appear wildly.
If I were asked to do what you have been requested to do (and I have been sometimes), I would argue whether turning the current anarchy into user-friendly messages is beyond the scope of X
.
At least, "rise the hand", expose your concerns, expose your alternatives and let whoever has the accountability to decide.
Make your solutions valuable for the company
Check for error message string and have a mapping in my service to a
user-friendly message. But things can break if the callee service
changed their error message. Fallback to a default error message when
a custom error mapping is not found.
You are right. That's a weak solution. It's brittle and inefficient in the mid-long run.
I also think it causes coupling since changes in these strings might force you to refractor the mappings. Not a big deal improvement.
Any more ideas on a scalable and sustainable solution?
Reporting. Handle the errors, give a code/ticket/id to them and report. Then, allow the front-end to visualize the report. For instance, sharing a link to the reporting service.
Error. < A user-friendly and very default error message >. Follow the link for further information
This way, you can integrate as many services as you need. And you release yourself from the overhead of handling and translating random strings into new random, but user-friendly, strings.
The reporting service is reusable for the rest of the services so that, if you have correlated IDs, should be possible for you to allow users to have a panoramic view of the errors and the causes. In distributed architectures, traceability is quite important.
Later, the reporting service can be enhanced with as many mappings as you need to give readable and useful instructions about what to do if error X happens. If strings change here doesn't matters at all. What we have (store) is a final state of the report.
The reporting service will open the door to a possible normalization of the errors within the organization since the service will expose a public API (hence a contract).
Best Answer
There are some important aspects you should consider first.
Streaming
Let's imagine the 100 MB file is received by the service A which transfers it to service B, which, in turn, uses service C to do the actual parsing of the proprietary format.
The wrong approach would be for the services A and B to start sending the file to the underlying service only after they completely received the file from the client:
Instead, as soon as they start receiving the file, they should stream it to the underlying service.
This means that you're not waiting the time it takes to transfer 100 MB three times, but only one time, plus the latency...
Latency
Latency, on the other hand, cannot be avoided. Every intermediary service would still have to open the HTTP/HTTPS connection to the underlying service, before starting to transfer the file.
If your micro-services are located in the same data center, chances are the latency is a matter of a few milliseconds. If the services are hosted in different data centers, the latency may grow. With a high number of intermediaries, this can become a problem, and it will affect even small requests.
Possible DOS
When using the streaming technique, you should check that you don't open yourself to a possible DOS attack. The risk is that the intermediaries will keep the HTTP connection as long as the client is sending the file. The DOS attack would then consist of sending lots of files at a very low speed in order to exhaust the connections that the services are able to process.