Database Architecture – Is Functionality in DB a Roadblock to Scalability?

application-designArchitecturedatabase

I may not be able to give the right title to the question. But here it is,

We are developing financial portal for wealth management. We are expecting over 10000 clients to use the application. The portal calculates various performance analytics based on the the technical analysis of the stock market.

We developed lot of the functionality through Stored procedures, user defined functions, triggers etc. through Database. We thought we can gain huge performance boost doing stuff directly in database than through C# code. And we actually did get a huge performance boost.

When I tried to brag about the achievement to our CTO, he counter questioned my decision of having functionality implemented in database rather than code. According to him such applications suffer scalability problems. In his words "These days things are kept in memory/cache. Clustered data is hard to manage over time. Facebook, Google have nothing in database. It is the era of thin servers and thick clients. DB is used only to store plain data and functionality should be completely decoupled from the database."

Can you guys please give me some suggestions as to whether what he says is right. How to go about architect such an application?

Best Answer

In short, I would agree with your CTO. You've probably gained some performance at the expense of scalability (if those terms are confusing, I'll clarify below). My two biggest worries would be maintainability and lack of options to scale horizontally (assuming you are going to need that).

Proximity to data: Let's take a step back. There are some good reasons for pushing code into a DB. I would argue that the biggest one would be proximity to the data - for example, if you are expecting a calculation to return a handful of values, but these are aggregations of millions of records, sending the millions of records (on-demand) over the network to be aggregated elsewhere is hugely wasteful, and could kill easily your system. Having said this, you could achieve this proximity of data in other ways, essentially using caches or analysis DBs where some of the aggregation is done upfront.

Performance of code in the DB: Secondary performance effects, such as "caching of execution plans" are more difficult to argue. Sometimes, cached execution plans can be a very negative thing, if the wrong execution plan was cached. Depending on your RDBMS, you may get the most out of these, but you won't get much over parametrised SQL, in most cases (those plans typically get cached, too). I would also argue that most compiled or JIT'ed languages typically perform better than their SQL equivalents (such as T-SQL or PL/SQL) for basic operations and non-relational programming (string manipulation, loops, etc), so you wouldn't be losing anything there, if you used something like Java or C# to do the number crunching. Fine-grained optimisation is also pretty difficult - on the DB, you're often stuck with a generic B-tree (index) as your only data structure. To be fair, a full analysis, including things like having longer-running transactions, lock escalation, etc, could fill books.

Maintainability: SQL is a wonderful language for what it was designed to do. I'm not sure it's a great fit for application logic. Most of the tooling and practices that make our lives bearable (TDD, refactoring, etc) are difficult to apply to database programming.

Performance versus scalability: To clarify these terms, I mean this: performance is how quick you'd expect a single request to go through your system (and back to the user), for the moment assuming low load. This will often be limited by things like the number of physical layers it goes through, how well optimised those layers are, etc. Scalability is how performance changes with increasing number of users / load. You may have medium / low performance (say, 5 seconds+ for a request), but awesome scalability (able to support millions of users). In your case, you will probably experience good performance, but your scalability will be bounded by how big a server your can physically build. At some point, you will hit that limit, and be forced to turn to things like sharding, which may not be feasible depending on the nature of the application.

Premature Optimisation: Ultimately, I think you've made the mistake of optimising prematurely. As others have pointed out, you don't really have measurements showing how the other approaches would work. Well, we can't always build full-scale prototypes to prove or disprove a theory... But in general, I'd always be hesitant to chose an approach which trades maintainability (probably the most important quality of an application) for performance.

EDIT: On a positive note, vertical scaling can stretch quite far in some cases. As far as I know, SO ran on a single server for quite some time. I'm not sure how it matches up to your 10 000 users (I guess it would depend on the nature of what they are doing in your system), but it gives you an idea of what can be done (actually, there are far more impressive examples, this just happens to be a popular one people can easily understand).

EDIT 2: To clarify and comment on a few things raised elsewhere:

  • Re: Atomic consistency - ACID consistency may well be a requirement of the system. The above doesn't really argue against that, and you should realise that ACID consistency doesn't require you to run all your business logic inside the DB. By moving code which does not need to be there into the DB, you're constraining it to run in the physical environment of the rest of the DB - it's competing for the same hardware resources as the actual data management portion of your DB. As for scaling only the code out to other DB servers (but not the actual data) - sure, this may be possible, but what exactly are you gaining here, apart from additional licensing costs in most cases? Keep things that don't need to be on the DB, off the DB.
  • Re: SQL / C# performance - since this seems to be a topic of interest, let's add a bit to the discussion. You can certainly run native / Java / C# code inside DBs, but as far as I know, that's not what was being discussed here - we're comparing implementing typical application code in something like T-SQL versus something like C#. There a number of problems which have been difficult to solve with relational code in the past - e.g. consider the "maximum concurrent logins" problem, where you have records indicating a login or logout, and the time, and you need to work out what the maximum number of users logged in at any one time was. The simplest possible solution is to iterate through the records and keep incrementing / decrementing a counter as you encounter logins / logouts, and keeping track of the maximum of this value. It turns out that unless your DB supports a certain sliding window aggregation (which SQL 2008 didn't, 2012 may, I don't know), the best you can do is a CURSOR (the purely relational solutions are all on different orders of complexity, and attempting to solve it using a while loop results in worse performance). In this case, yes, the C# solution is actually faster than what you can achieve in T-SQL, period. That may seem far-fetched, but this problem can easily manifest itself in financial systems, if you are working with rows representing relative changes, and need to calculate windowed aggregations on those. Stored proc invocations also tend to be more expensive - invoke a trivial SP a million times and see how that compares to calling a C# function. I hinted at a few other examples above - I haven't yet encountered anyone implement a proper hash table in T-SQL (one which actually gives some benefits), while it is pretty easy to do in C#. Again, there are things that DBs are awesome at, and things that they're not so awesome at. Just like I wouldn't want to be doing JOINs, SUMs and GROUP BYs in C#, I don't want to be writing anything particularly CPU intensive in T-SQL.
Related Topic