Python – Is it conceivable to have millions of lists of data in memory in Python

data structuresMySQLoptimizationpythonspeed

I have over the last 30 days been developing a Python application that utilizes a MySQL database of information (specifically about Norwegian addresses) to perform address validation and correction. The database contains approximately 2.1 million rows (43 columns) of data and occupies 640MB of disk space.

I'm thinking about speed optimizations, and I've got to assume that when validating 10,000+ addresses, each validation running up to 20 queries to the database, networking is a speed bottleneck.

I haven't done any measuring or timing yet, and I'm sure there are simpler ways of speed optimizing the application at the moment, but I just want to get the experts' opinions on how realistic it is to load this amount of data into a row-of-rows structure in Python. Also, would it even be any faster? Surely MySQL is optimized for looking up records among vast amounts of data, so how much help would it even be to remove the networking step? Can you imagine any other viable methods of removing the networking step?

The location of the MySQL server will vary, as the application might well be run from a laptop at home or at the office, where the server would be local.

Best Answer

I would avoid to load all the 2.1 million rows into memory (at least, it would not be my first choice to do so). Even if you can do this with one query, and even if you get around the memory issues, it will still take a while to get all that data to the client. Lookups in that data may be probably faster than by a lot of SQL queries over the network, but if that really pays in comparison to the initial loading is questionable (of course, you will have to measure to be sure).

If I got you right, you don't need the all that data to perform the validation - you need only data for ~10.000 addresses, and a number of records from the database of comparable size. What I would probably try to is to reduce the number of queries to get exactly that data. Perhaps you don't need 20 queries per address and can reduce it to 10? This would be a start. Perhaps you can get the data for multiple addresses in one query instead of many, using the IN or OR operators in the WHERE clauses of your statements? That would probably help, too, since for small amount of data, often only the number of queries is the key factor, not if you get 10, 100 or 1000 bytes result set.

If you want more specific advice, you have to provide more details about the data model and the queries you are using.

EDIT: assumed your address database is frozen for several months, that would qualify to use a local copy of the whole database as a cache. Something like SqlLite should be the logical choice here: a very fast, server-less, in-process (and optionally in-memory) database, where the whole data can be stored in one file. You can still use SQL for all queries you need, and the transition from MySQL tables to SqlLite tables should be straightforward (as long as you did not use any too-MySQL-specific things in your current database so far).

Consider, if you are trying to keep the local copy of your db tables completely in Python row-of-rows, depending on the complexity of the data model, you would probably end up building something like your own "poor-mans" in-memory database on your own. So why reinvent the wheel?

Information about the SqlLite Python interface: http://docs.python.org/2/library/sqlite3.html. Another advantage: if the address data is updated on the server, this approach will more or less easily allow to create the database copy once on the server and send the data in one file over the network to the client.

Related Topic