Database Design for Scalability – Choosing the Right Schema and Data Model

database-designscalability

We want to store some genomic variant data but there are some problems, more important ones like problem of the data's immense size and variability.

  1. Variant data can be huge. For example, a single individuals variant data could feasibly some day require a million rows of data in a table, or require of a gigabyte of raw storage on disk. Multiply this over several thousand individuals, and you could potentially end up with terabytes worth of information that you need to make sense of.

  2. Each client and/or system we integrate with, will expose or want to see data slightly differently depending on their needs and use cases. This can potentially lead to hundreds of fields that we might need to store, all of which might need to be in different configurations based on the clients needs. So this variant data model will need to keep this in mind in order to remain easy to use, expandable and most
    importantly, scalable in the long term.

What do you think is better for such a problem? We were thining of having some coulmns in each table that point to an external database or even a file, where we save the huge BLOB data?

Best Answer

I don't know enough about your system but you need to look at the following:

1-How you obtain the data and in what format? Answering this will give options of how to store it and load it initially if you will end-up using a database.

2-How do you process this raw data? Answering this, will help you figure the 'active' set size. This will help in deciding how to store and how to load the data also. You may find that you don't need the entire input record and all you need is few fields of it only. If most of the fields are not used, you can keep them in a separate archived storage.

3-How do you inquire this data (online/batch and what criteria is most likely to be used)? Answering this will be the key factor in answering how to store the data what parts to keep on-line and what parts to keep off-line. Oracle for example allows you to run SQL on text files without loading the files in first. This could be a huge time saver, but of course it depends on your scenario.

as per your point:

single individuals variant data could feasibly some day require a million rows of data in a table

I really don't understand how this is possible. If it is accurate, I am not sure how it will be used. Maybe you need to separate the concepts of mere data storage from the concept of which parts of the data will be used. If you understand more about how the data will be used, you may be able to cut down the number of rows by aggregation or a similar technique.

In short much analysis is required for before a solution can be found. The guiding principles are:

1-Know your data well

2-Cut down on row size by keeping the needed columns only and linking to off-line storage when possible

3-Cut down on total row numbers by aggregation when possible

4-Use table partitioning and avoid excessive indexing

5-Know how the users need to use this data

6-Consider loading data as it arrives

7-You are probably going to need a star schema (fact and dimensions) to speed queries, but we can't tell by just the information provided

Related Topic