Microservices – File Storage Microservice Design

Architecturedatamicroservices

Problem overview:
I'm creating a spring application just for learning purposes. I would like to create a microservice just for files, which:

  • at the beginning would have only two basic endpoints upload/download a file (used by other microservices, not by a user directly)
  • create db only for that microservice with thefile table. At least at the beginning.
  • store files in a cloud (f.e. Amazon S3)
  • store meta-information about the file in the file table. By meta I mean:
    • Url to file in the cloud
    • Name
    • Size
    • Type (a directory or a file)
    • Extension

Example flow:

  • a microservice A sends an image: user_attachment.jpg to the files microservice upload endpoint.
  • server does validation, secuirty check etc.
  • sends the file to a cloud storage f.e. Amazon S3
  • saves meta-info about that file in a local table called file. F.e. it adds: INSERT INTO file (url, name, size, type, extension) VALUES ("http://aws.amazon.com/testapp/pictures/2018-09/picutre.jpg", "picture", 34233, "FILE", "JPG")
  • send meta-info back to the microservice A that initialized upload operation.

Uploading to cloud would be in a different thread, asynchronously and maybe even with a messagebroker help.


Questions:

  1. Is it a clean/good-design-patterns-compliant approach ?
  2. What is good/bad about that approach?
  3. How could I do that in a better way ? (don't want to store files (content) on the machine where application is running [locally] )

Misc:
This example is simple. I believe in commercial/big projects design might have similar assumptions/foundations. Thus let's assume that there is multiple tables in the file microservice, and both their structure and microservice logic are advanced.

How does file-management-microservice in commercial application might look like according to good design-pattern rules ?

Best Answer

A goal of a service in SOA is to perform some logic which, otherwise, would have to be duplicated among other services. That logic is, in turn, the reason for the service to exist.

In your case, you add no value through the service. Any other service could easily access S3 directly, given that S3 is itself a service (it is not your service, but that is irrelevant here).

Not only there is no added value to the intermediary service, but its cost is rather prohibitive. You may imagine that you can draft one in a few hours, which is true. However, in order to have something that works reliably and handles all the edge cases, you have to deal with caching, redundancy, interrupted uploads and recovery, large files, timeouts, brute force and DDOS, special characters in parameters, permissions and IAM, and dozens of other pretty things.

Therefore, you're about to spend two or three weeks of development and possibly one week of debugging (YMMV; I don't know your experience in HTTP, SOA and the technology you want to use to create your service, but even if you're highly experienced, you shouldn't expect spending less than two weeks on that, which is a lot).

I just wanted to know how it's done in commercial/big projects.

Companies pay on average $700/day for a highly experienced developer, so given my estimate of two weeks, you're at $7,000 for a task. Commercial projects are about money; when there is an opportunity to save an amount even as small as $7,000, they'll do it.

Additional to this direct cost, there is a cost in terms of hosting and code maintenance. This service will have to be maintained for years, adding to the bill much more than the original price. Again, all this money wasted without the clear understanding of what could such service save to the company. It doesn't save bandwidth, nor other resources. It doesn't reduce the amount one will pay to Amazon, so...

This is not all. The cost of maintenance of all the projects which depend on the intermediary service will also increase. If a service:

  • Has to be patched and the patch requires an interface change,
  • Has to be moved to another location, with a change in its URL,
  • Is down, requiring to have a circuit breaker to ensure the client service doesn't go down in turn,
  • Is deprecated, requiring to migrate the client to something else,

the immediate and unforeseen maintenance is required and is usually costly as well. Now, it is much more likely that those four things happen to your service than to Amazon's S3. Not because you are a bad developer, no. But because Amazon's scale is slightly different than the scale of your service, which means that they have much more workforce to pay to ensure the clients can rely on S3.

Finally, many developers have prior experience with Amazon AWS (and possibly S3). This means that when you hire a new guy, he can easily understand how a service is storing files if it uses S3 directly. If you add a level of indirection, this benefit is lost. And, more importantly, every time someone has a problem with the storage, he would need to ask himself if the problem comes from the client service, from your intermediary or from S3. This adds to the debugging time.

So:

Is it a clean/good-design-patterns-compliant approach ?

No. Add services when they add value. Don't make things more complex than they need to be (KISS) and, specifically, don't add layers which bring no benefits.

What is good/bad about that approach?

What is good: the fact that you provide an interface which is much simpler compared to S3. For developers unfamiliar with AWS, S3 can be quite complex.

What is bad: already told it above.

How could I do that in a better way ? (don't want to store files (content) on the machine where application is running [locally] )

By calling directly S3.

Related Topic