I would avoid to load all the 2.1 million rows into memory (at least, it would not be my first choice to do so). Even if you can do this with one query, and even if you get around the memory issues, it will still take a while to get all that data to the client. Lookups in that data may be probably faster than by a lot of SQL queries over the network, but if that really pays in comparison to the initial loading is questionable (of course, you will have to measure to be sure).
If I got you right, you don't need the all that data to perform the validation - you need only data for ~10.000 addresses, and a number of records from the database of comparable size. What I would probably try to is to reduce the number of queries to get exactly that data. Perhaps you don't need 20 queries per address and can reduce it to 10? This would be a start. Perhaps you can get the data for multiple addresses in one query instead of many, using the IN
or OR
operators in the WHERE
clauses of your statements? That would probably help, too, since for small amount of data, often only the number of queries is the key factor, not if you get 10, 100 or 1000 bytes result set.
If you want more specific advice, you have to provide more details about the data model and the queries you are using.
EDIT: assumed your address database is frozen for several months, that would qualify to use a local copy of the whole database as a cache. Something like SqlLite should be the logical choice here: a very fast, server-less, in-process (and optionally in-memory) database, where the whole data can be stored in one file. You can still use SQL for all queries you need, and the transition from MySQL tables to SqlLite tables should be straightforward (as long as you did not use any too-MySQL-specific things in your current database so far).
Consider, if you are trying to keep the local copy of your db tables completely in Python row-of-rows, depending on the complexity of the data model, you would probably end up building something like your own "poor-mans" in-memory database on your own. So why reinvent the wheel?
Information about the SqlLite Python interface: http://docs.python.org/2/library/sqlite3.html. Another advantage: if the address data is updated on the server, this approach will more or less easily allow to create the database copy once on the server and send the data in one file over the network to the client.
Size is not so much of an issue, the ability to query and maintain the data however is.
If, for example, Greenhaven Press decides they want to change their name to Greenhaven Press International, you'll have to find the record, deserialize it, change it, serialize it, pump it back into the database.
Consider this: does storing these objects as serialized data offer you a clear added value over storing it in a relational form? If the answer is no, then it might not be worth the hassle.
UPDATE
As far as your update of your question goes: I'm inclined to say no, it makes little or no difference. Whether you update one field or all of them in this json string is irrelevant because the whole process is identical.
Don't forget that your requirements might change; even though you're using json on the client side now doesn't mean you'll need json in the future. Storing your data in a relational form guarantees technology-independence while preserving relationships, data constraints and queryable metadata: this is where the true value of a relational db lies. Discarding those advantages will neither give you a performance gain nor make your application more scalable or flexible.
Best Answer
Define a class that contains fields with the appropriate types for the columns in your table and with the same names as they will be represented in your JSON document. The class should contain little or no logic. Such a class is usually called a Data Transfer Object.
You should have another object that loads the data from your database into your DTOs. This is called a Data Mapper (you can use a library for this, eg Hibernate, or do it yourself ... ironically, your code will be simpler to do it yourself).
Another object can convert your DTOs to JSON. Again, you can use a library for this (Jackson is a popular choice, and doesn't tend to introduce complexity like Hibernate does) or do it yourself (perhaps using a library, eg Apache Commons "lang", to escape strings, which is annoyingly tricky to get right).