A goal of a service in SOA is to perform some logic which, otherwise, would have to be duplicated among other services. That logic is, in turn, the reason for the service to exist.
In your case, you add no value through the service. Any other service could easily access S3 directly, given that S3 is itself a service (it is not your service, but that is irrelevant here).
Not only there is no added value to the intermediary service, but its cost is rather prohibitive. You may imagine that you can draft one in a few hours, which is true. However, in order to have something that works reliably and handles all the edge cases, you have to deal with caching, redundancy, interrupted uploads and recovery, large files, timeouts, brute force and DDOS, special characters in parameters, permissions and IAM, and dozens of other pretty things.
Therefore, you're about to spend two or three weeks of development and possibly one week of debugging (YMMV; I don't know your experience in HTTP, SOA and the technology you want to use to create your service, but even if you're highly experienced, you shouldn't expect spending less than two weeks on that, which is a lot).
I just wanted to know how it's done in commercial/big projects.
Companies pay on average $700/day for a highly experienced developer, so given my estimate of two weeks, you're at $7,000 for a task. Commercial projects are about money; when there is an opportunity to save an amount even as small as $7,000, they'll do it.
Additional to this direct cost, there is a cost in terms of hosting and code maintenance. This service will have to be maintained for years, adding to the bill much more than the original price. Again, all this money wasted without the clear understanding of what could such service save to the company. It doesn't save bandwidth, nor other resources. It doesn't reduce the amount one will pay to Amazon, so...
This is not all. The cost of maintenance of all the projects which depend on the intermediary service will also increase. If a service:
- Has to be patched and the patch requires an interface change,
- Has to be moved to another location, with a change in its URL,
- Is down, requiring to have a circuit breaker to ensure the client service doesn't go down in turn,
- Is deprecated, requiring to migrate the client to something else,
the immediate and unforeseen maintenance is required and is usually costly as well. Now, it is much more likely that those four things happen to your service than to Amazon's S3. Not because you are a bad developer, no. But because Amazon's scale is slightly different than the scale of your service, which means that they have much more workforce to pay to ensure the clients can rely on S3.
Finally, many developers have prior experience with Amazon AWS (and possibly S3). This means that when you hire a new guy, he can easily understand how a service is storing files if it uses S3 directly. If you add a level of indirection, this benefit is lost. And, more importantly, every time someone has a problem with the storage, he would need to ask himself if the problem comes from the client service, from your intermediary or from S3. This adds to the debugging time.
So:
Is it a clean/good-design-patterns-compliant approach ?
No. Add services when they add value. Don't make things more complex than they need to be (KISS) and, specifically, don't add layers which bring no benefits.
What is good/bad about that approach?
What is good: the fact that you provide an interface which is much simpler compared to S3. For developers unfamiliar with AWS, S3 can be quite complex.
What is bad: already told it above.
How could I do that in a better way ? (don't want to store files (content) on the machine where application is running [locally] )
By calling directly S3.
Bottom Line Up Front: You will likely have to start with a compromise.
Micro-services, and multi-tenancy are hard. You have to consider the trade-offs on cost to run, maintain, and build your solutions. The answers are going to conflict with what makes the system more robust and secure. The challenge is to figure out where your project needs to start, and what compromises you have to accept for the moment.
There are a couple axioms to keep in mind:
- Complexity and cost are directly related. The more complex something is, the more expensive it will be to build it and maintain it.
- Isolated systems are generally safer, but also are more complex. When two tenants data never touch, they can't affect the other.
- We are not all FaceBook. Meaning that most companies have to worry about cost more than isolation and the required complexity that comes with it.
When you start breaking down the different topics, you are going to find that what is more correct for one answer is less correct for another. For example, your first topic and your second topic have different answers.
Maintainability
One thing is easier to maintain than several things. That goes even more for your database.
Having one large shared database cluster is going to be easier to manage the following:
- Backup/Restore
- Load balancing a cluster
At least they will up to a point. The problem you may get to is that one of your application's tenants has vastly more demands than another. If your database is a shared resource between the tenants, you will eventually run into the situation where your super users are impacting your service to the other tenants. That may not be something you have to worry about on day one.
Impact of Disasters
If your database goes down, you will need to restore the database server then restore the latest backup.
- All tenants served by the database server that went down are affected.
- One database for all tenants means all your customers are affected
- Separate databases for each tenant means only that tenant is affected
- Some databases are designed to scale out
- Sharding spreads the data across multiple nodes in a cluster
- Replication adds redundancy to your data spread across those nodes
- These are designed to allow a single node to be lost, and replaced without any loss of data or service
It's worth looking in to databases that are designed to scale out. Examples would be Apache Cassandra, Mongo DB, Raven DB, etc. Most NoSQL databases are designed around this concept. The upshot is that you have one "logical" database, but multiple processing nodes allow you to expand capacity as you need. It might be a worthwhile compromise to simplify your data design while having the robustness and safety you need.
Feasibility of multi-tenant database approach
That's something you'll have to evaluate. The approaches you are weighing against each other are:
- One database for everything
- One database per tenant
- One database per micro-service
- One database per micro-service per tenant (the utmost in isolation)
To perform a useful analysis of alternatives, you need to define:
- Key performance areas/Requirements -- know what is important for your app
- Cost of the solution
- T-shirt size estimates of what it would take to implement each approach
Create the chart, see how each approach hits those check marks, and then make a decision. Remember the axiom about complexity and cost being directly related? The decision you have to make right now may not be what the pundits say is the most correct thing. You have to live within budget constraints. As your application brings in more revenue, your budget will increase, which will allow you to update your system in ways you can't consider right now.
Security
Security is a complicated topic, that has so many facets that again you have to make decisions based on the real legal requirements you have in your country, or that your clients demand. Below are a just a few security related concepts:
- Non-repudiation (i.e. a user cannot deny the actions they performed)
- Auditing (i.e. you can reconstruct the actions a user performed to find bad actors)
- Data protection (i.e. a user cannot see information they are not allowed to see)
- Infrastructure security (i.e. network access, file access, etc. are properly protected)
- Data encryption (i.e. a user cannot discover someone else's data by sniffing network packets)
There is even more than that. Many security aspects will be constant across your alternatives (like encryption, infrastructure security, etc.) However, the answer to the concept of data protection is more secure if your database does not have data from multiple tenants inside of it. That may not matter if the user can't access the database directly.
When dealing with security concerns, it's best to understand what you are actually required to handle:
- Are there legal requirements you need to comply with? (UK and several other countries have very strict user privacy laws, while other countries do not)
- Are there standards your clients demand?
- Are there simple and low cost things you can do to improve security?
Even when you consider user privacy laws, the security demands of a bank or health care system are going to be much greater than those needed for a social networking app.
Summary
Your team (manager included) need to define the following:
- Requirements -- what your multi-tenant application really needs, also the security requirements
- Constraints -- budget, schedule, tools (some shops will define tools that cannot be used, and others may define tools that must be used)
- Key Performance Areas -- includes performance criteria, management support, etc.
Without those, you won't be able to settle on something that fits the unique demands of your application. The most correct thing is going to a bit different for each application because the unique requirements and constraints you have to work with influence what that actually is.
Best Answer
I think option 2 is not a bad one, but may not be needed. Micro services are for letting you deal with the needs of multiple applications.
A big factor here, is if there is any difference between the two schemas, and if there ever will be in the future.
Usually, I think using interfaces for repositories is unnecessary; however, it might be worth the effort in this instance. Repository factories will be important for you.
My issue with option 1 is that it is too specific. You should be able to go from the setup which you described, to two separate instances each pointing to its own DB easily. The application should NOT CARE WHERE IT IS GETTING ITS DATA FROM.
While the schema does not differ for your two different database, you can have one repository easily deal with both, without the application knowing the difference:
If the DB schemas ever become disparate between the US and UK, then you would then split the functionality into two completely different repositories. This would be easy, since all you would have to do is change your factory.