Mysql – Storing Hundreds of Millions of Records

couchdbmongodbMySQLpostgresql

My company will be coming into possession of a dataset consisting of approximately 200-300 million records. The source material is csv and is about 150GB uncompressed. We would need to perform an initial load of the data and then updated approximately 1% of the records on a daily basis. We would love to be able to keep the history of each record as well.

We currently use MySQL and it appears that some people are using MySQL and PostgreSQL for databases of this size, but I don't see much hard information about their experiences.

We could definitely get away without normalizing the data and I can envision distributing the information over a lot of servers. How about MongoDB or some other nontraditional datastore?

Does anyone have any thoughts on the feasibility of this sort of endeavor? I appreciate any help you might be able to give.

Best Answer

My experience in datasets of that size are limited to MSSQL, but it can most definately handle data of that size.

My first concern is the size of the data. 300 million records at 150Gb is about 500Kb per row - and that's a big row. A very, very big row. If you can normalise to 3rd normal form then this could help dramatically (assuming there's data that can be normalised). If you're not going to normalise (and just have a single, massive table), then an engine that supports ISAM is going to be faster than a RDBMS, so MySQL in ISAM mode is the obvious choice over MSSQL (sorry, I don't have any experience with Postgre or Mongo)

That said, MSSQL can handle a table of that size no worries. It can partition the data so that different parts live on different disks, so you could keep you 1% updated data on a fast disk and keep the rest on a slower disk if budget is a concern. If your DBMS of choice supports this then it could be a wise way to go.

Just for reference, I once managed a database that had about 200 million rows in a single table (but the table was only 20Gb in size) and with some smart indexing query times were still being measured in the milliseconds. That was normalised to the 3rd normal form, so there were a lot of LOJ's to retrieve associated data as well.