Database – Scanning a Billion Rows in an Ultra-Fast Database

Background

A local database contains nearly 1.3 billion unique rows. Each row is indirectly associated with a specific latitude and longitude (location). Each row has a date stamp.

Use Case

The problem is as follows:

The user sets a starting/ending date, and a range of values (e.g., 100 to 105).
The system gathers all the rows that match the given date, grouped by location.
The system performs determines the locations that, during those dates, have a statistical likelihood of falling into the given range of values.
The system displays all matching locations to the user.

This is a problem of speed and scale.

Question

What is the least expensive solution architecture you can imagine that would allow such a system to retrieve results for users in under five seconds?

Current System

The environment is currently:

PostgreSQL 8.4 (upgrade is possible; switching databases is not an option)
R and PL/R
XFS
WD VelociRaptor
8 GB RAM (Corsair G.Skill; 1.3 GHz)
Quad core GenuineIntel 7 (2.8 GHz)
Ubuntu 10.10

Hardware upgrades are acceptable.

Update – Database Structure

The billions of rows are in a table resembling:

id | taken | location_id | category | value1 | value2 | value3

id – Primary key
taken – Date assigned to the row
location_id – Reference to the latitude/longitude
category – A description of the data
value1 .. 3 – The other values the user can query

The taken column is typically consecutive dates per location_id, sometimes each location has data from 1800 to 2010 (about 77,000 dates, many of them duplicated as each location has data in the same date range).

There are seven categories and the tables are already split by category (using child tables). Each category contains ~190 million rows. In the near future, the number of rows per category will exceed a billion.

There are approximately 20,000 locations and 70,000 cities. The locations are correlated to city by latitude and longitude. Assigning each location to a particular city means finding the city's boundaries, which is not a trivial task.

Ideas

Some ideas I have include:

Find a cloud service to host the database.
Create an SSD raid stripe (great video).
Create a table that amalgamates all the locations by city (pre-calculation).

Thank you!

Best Answer

The most important thing is to be absolutely certain where the bottleneck is now for a given number of representative requests as you cannot switch databases.

If you do full table scans, you need appropriate indexes.

If you wait on I/O you need more memory for caching (Jeff Atwood recently mentioned that 24 Gb systems were reachable on desktop systems).

If you wait on CPU you need to see if your calculations can be optimized.

This requires a pointy DBA-hat and a Operating System-hat, but is worth it to ensure you are barking up the right tree.