Lets say I want to store songs in my database. Instead of having just one Song
table, I have ten tables. The Song
table also has a foreign key to the Artist
table. When an artist is added to the database, we check exactly how many songs are there in each of the Song
tables, and we assign the table with the smallest number of songs to the artist. All of the artist's songs will be stored in that Song
table.
I don't want to store 1M objects in one table, but rather divide them into ten pieces, each of around 100k objects, and store these pieces in ten different, but structurally similar tables. Now, provided that the song table reference in the artist objects are never changed, will my overall system be faster and have better performance?
I realize one big issue would be finding individual songs, but please answer this question under the context that songs can only be retrieved from the database by providing 2 parameters:
- artist_id
- song_id
If I have the artist_id, I can use that to get my artist object, which contains a reference to the song table that contains the song with the song with the song_id give. So I don't have to query ten different tables for finding a song, if I have the artist_id, which will always be the case.
Is this going to be completely useless? Or will it have a positive impact on my system's performance?
Note: I realize songs should never be stored this way, since you would want to query songs without knowing the artist, but this is only for an example, albeit a poor one. Also, please ignore the fact that these would be a mess to code and manage when answering. I only want to know about the performance impacts.
Best Answer
Dividing one logical table into multiple tables within the same database has zero benefits. This will complicate queries and may in fact hurt performance, because finding elements is more difficult. Instead of a simple query, you would have to repeat the query for each table and then take the UNION of the results.
In a well-administered database, having millions or billions of elements within a table is no problem at all. You will need suitable indices to get bearable performance for your queries, but you should be doing that anyway.
Sometimes, a “table” is indeed split up so that it can be distributed across multiple databases, or multiple nodes of a distributed database. This is called sharding and is useful if a single database is not sufficient to provide the required read/write performance due to hardware limitations. However, there are some drawbacks.
Many databases have built-in support for sharding. A SQL database might transparently divide a table by its primary key – without having to modify any queries (but see the database's manual for caveats, e.g. whether this will relax some ACID guarantees). This clear separation between the logical table structure (exposed via SQL) and the physical table structure (e.g. the storage engine and index data structures) is the major feature of SQL databases!
Where possible, using a read replica database can be preferable over sharding. All writes go to the primary database, but the load of reads can be distributed across replicas. Transactional updates are still possible, although reads from replicas might be out of date.
So databases have many techniques for improving performance, such as sharding across multiple nodes. But in many cases this can be done transparently, you should not modify your table structure in anticipation of this. Quite likely, you do not need any scaling techniques, and a single database would be able to deliver sufficient performance when it is well-designed (proper ER modelling, use indices, don't normalize excessively, …).