Database Scalability – How Splitting a Database Table Works on Replicated SQL Servers

data-replicationdatabasedelphisql server

I would like to know more about the general concepts behind dividing data of a database into different servers. For example, suppose I have a SQL Server database which has a massive table. Assume one single server cannot handle the amount of data in this table. I would like to break that table down and split its contents among different servers with an identical database (replicated).

Now, suppose I have a single SQL (ADO) connection to any one of these servers, and choose to select records from this particular table. Since the data is in separate databases on separate servers, I need to gather the records from all the various servers and combine them into one.

I'm sure there's standard ways of accommodating for this, and I'm willing to go another route than a direct SQL Server connection (I plan to wrap it in my own HTTP Server API anyway) but I'm still going to use SQL Server as the actual engine. What is the most standard practice and where can I learn more about it?

Best Answer

I think ultimately what you are looking for is an explanation of different ways that SQL Server can be utilized to handle extremely large tables and data sets being replicated across the database server instances. There are multiple approachs to this problem and none of them are a silver bullet, but understanding the basics of each is a good way to make the right architecture choice.

Sharding

Database sharding is a way to take an enormous data set and partition it into a number of different file groups. These individual filegroups can be stored all on a single server, multiple servers, or better yet through a RAID based Storage Area Network.

The bottom line here is that physical IO performance significantly improves when you are accessing records from only a few physical file groups than from an enormous filegroup in an unsharded dataset.

SQL Server does not have a cheap or easy way to do this out of box, nor do I know of any easy or cheap way to do this with other major database providers. Once you start talking about shards then you are playing a different ball game altogether.

Distributed Partition Views

These enable well partitioned datasets to be accessed in such a way that one would only fetch the necessary records based on your partitioning strategy. There is good information on MSDN about this with SQL Server.

Active Clustering

The concept of clustering is to take multiple SQL Server instances as nodes and have them work together on a single database, synchronized so that partitions on different file groups are being read and written to in a way that does not corrupt or damage the data, and in a way that scale up massively.

Peer to Peer Replication

In this setup all database servers have their own instance, and each should be configured to only affect their own little area of the schema. Data is frequently out of date between them and needs to be synced through some outside process and schema changes can be extremely difficult to implement. This is not the ideal way to handle this in your application.

I can't afford distributed partitions and massively scalable active clustering!

So you don't have upwards of a million to drop on some licenses and hardware for massively scalable solutions? There is always the Cloud. Azure and other SQL Server PaaS (Platform as a Service) solution providers can provide you the scalability that you need without the large upfront investment. This is generally the easiest route to go.

Related Topic