Speed of MySQL type index access vs. binary jump search on a huge file

binary treeMySQLperformance

I am dead set on moving over to a MySQL database for some huge data sets I'm working with but right now I don't have the time. In the meantime I am curious about a technical performance issue regarding speed between the two methods.

Obviously binary jumping on a huge sorted file is a nasty hit to performance. The chances of finding the data you need in the chip cache or even memory is pretty bad if you assume a statistically normal distribution of record requests across the huge file. Unless the whole file is in memory, and some of the ones I am working with are 20GB so this is impossible on the Win32 system I am using, it's almost a certainty that a large number of record requests will degrade to the performance killing operation of an actual hard disk read.

Since I have never done any database index programming other than some really simple B-tree indexing, I wonder how good the indexes created by modern databases like MySQL are at avoiding hard disk hits. Obviously one big advantage they have is if the index keys are a lot smaller than the data records they represent, you can jam a lot more of the index pages, if not all of it, into memory and that avoids a lot of disk hits. But I was wondering if the code behind those indexes can successfully optimize in other ways, especially when it comes to index page access prediction, to speed things up even more by making sure most index page accesses don't result in disk accesses?

If anyone has had any deep experience with the kind of code underlying MySQL indexing or similar entities, or has done some deep performance testing, I'd like to hear about it.

— roschler

Best Answer

Indexes tend to be much smaller than the table. If the whole index fits in memory, then there will be an average of 1 disk seek per random lookup. If not, there will generally be 2 disk seeks (once for the index, once into the table for your actual data). Keep in mind that a disk seek averages 1/200th of a second. If you're going to do a million lookups, this is going to be a problem.

In general the right approach if you have a dataset of that size is to use sorting and resorting heavily. That will let data stream to/from disk. Which is an operation that disk drives are very good at. If you can figure out an algorithm that avoids random lookups, you'll generally get order of magnitude improvements.

As for MySQL, be aware that relational databases are not magic pixie dust. They are an abstraction layer on top of things like btrees and sorting algorithms. This can save you a lot of time, but will never be faster than raw data manipulation if you know what you are doing. They may speed up development time, but in principle the same program against raw data will outperform. Frequently by very large margins.

With a smart database (eg PostgreSQL or Oracle) there is a chance that the database will be faster than you could be because the optimizer knows more about how to structure access to the data and will come up with a query plan that is better than you would have come up with. However MySQL's optimizer is not nearly as intelligent, and so it is unlikely that it will save you there. (It does give you SQL, relational constructs, transactions, and so on. Just not a good query planner for complex queries on a lot of data.)