Mysql – Querying a Large Dataset Fast: MySQL MEMORY (HEAP) engine, MyISAM, or something else (e.g. MemCached)

databasememcachedMySQLperformance

Currently working on a project that is centered around a medical vocabulary known as SNOMED. At the heart of snomed is a (relational) dataset that is 350,000 terms/records in length. We want to be able to quickly query this dataset for the data entry portion where we would like to have some shape or form of auto-completion/suggestion.

Its currently in a MySQL MyISAM DB just for dev purposes but we want to start playing with some in memory options. It's currently 30MB including the indexes. The MEMORY MySQL Engine and MemCached were the obvious ones, so my question is which of these would you suggest or is there something better out there? We're working in Python primarily at the app level if that makes a difference.

Side Note: The dataset has a separate relation that maps these concepts to common alias names and abbreviations that is about 1.3 million records… eventually we'll want to use a subset of this as well here.

Best Answer

It's not clear exactly what your requirements are, but with such a small database, queries are likely to be fast on any engine as all the data will fit in memory.

Things like MyISAM and InnoDB require some tuning to work. Memcached is only useful if you need to scale to very high-load read-based workloads, i.e. those which are too high for a single server to serve out of a conventional in-memory database (Memcached's main feature is ability to scale across a pool of servers). But memcached cannot do conventional queries such as range scans, only lookups on a specific key, so it's not a general-purpose db.

I'd recommend using innodb with a buffer pool bigger than your data set, and choose your primary keys correctly. A primary key range scan should perform very well and should be able to be used for prefix matching (e.g. completion use-cases).

You will be able to scale to, say, 1000 queries per second with that on normal hardware. If you need more, either add replication slaves or consider something like memcached.

Consider asking on StackOverflow?

Related Topic