pQd's estimate of 7PB seems reasonable, and that's a lot of data for a RDBMS. I'm not sure I've ever heard of someone doing 7PB with any shared disk system, let alone MySQL.
Querying this volume of data with any shared disk system is going to be unusably slow. The fastest SAN hardware maxes out at 20GB/sec even when tuned for large streaming queries. If you can afford SAN hardware of this spec you can affort to use something better suited to the job than MySQL.
In fact, I'm struggling to conceive of a scenario where you could have a budget for a disk subsystem of this spec but not for a better DBMS platform. Even using 600GB disks (the largest 15K 'enterprise' drive currently on the market) you're up for something like 12,000 physical disk drives to store 7PB. SATA disks would be cheaper (and with 2TB disks you would need around 1/3 of the number), but quite a bit slower.
A SAN of this spec from a major vendor like EMC or Hitachi would run to many millions of dollars. Last time I worked with SAN equipment from a major vendor, the transfer cost of space on an IBM DS8000 was over £10k/TB, not including any capital allowance for the controllers.
You really need a shared nothing system like Teradata or Netezza for this much data. Sharding a MySQL database might work but I'd recommend a purpose built VLDB platform. A shared nothing system also lets you use much cheaper direct-attach disk on the nodes - take a look at Sun's X4550 (thumper) platform for one possibility.
You also need to think of your performance requirements.
- What's an acceptable run time for a query?
- How often will you query your dataset?
- Can the majority of the queries be resolved using an index (i.e. are they going to look at a small fraction - say: less than 1% - of the data), or do they need to do a full table scan?
- How quickly is data going to be loaded into the database?
- Do your queries need up-to-date data or could you live with a periodically refreshed reporting table?
In short, the strongest argument against MySQL is that you would be doing backflips to get decent query performance over 7PB of data, if it is possible at all. This volume of data really puts you into shared-nothing territory to make something that will query it reasonably quickly, and you will probably need a platform that was designed for shared-nothing operation from the outset. The disks alone are going to dwarf the cost of any reasonable DBMS platform.
Note: If you do split your operational and reporting databases you don't necessarily have to use the same DBMS platform for both. Getting fast inserts and sub-second reports from the same 7PB table is going to be a technical challenge at the least.
Given from your comments that you can live with some latency in reporting, you might consider separate capture and reporting systems, and you may not need to keep all 7PB of data in your operational capture system. Consider an operational platform such as Oracle (MySQL may do this with InnoDB) for data capture (again, the cost of the disks alone will dwarf the cost of the DBMS unless you have a lot of users) and a VLDB platform like Teradata, Sybase IQ, RedBrick, Netezza (note: proprietary hardware) or Greenplum for reporting
I am not sure your assumption that an ORDER BY clause would require an index on f1 is actually correct.
I created such a table and ran
explain SELECT DISTINCT T.f1 as result FROM rowtest T WHERE f2=10 order by result LIMIT 0,30
And I got back this:
id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra
1 | SIMPLE | T | ref | idx_f2 | idx_f2 | 4 | const | 3 | Using where; Using temporary; Using filesort
Now the fact that the server will be using a temporary table and filesort is not hinting at a particularly fast or efficient way of doing this. However, there is nothing in there that says you need an index on f1. Ignore the fact that in my case there will only be 3 rows in the result set (I couldn't afford to create a table with 320 million rows).
Now: if I add an index to the table on column f1, the result of the explain doesn't change at all, which means whether you do or don't have an index doesn't matter.
The reason for this lies in the fact that the server first retrieves all the rows that satisfy the where condition (using the index on f2), and then orders them using a temporary file. During the retrieval of the rows the index on f1 is of no help, and during the ordering stage it is not present.
Considering that your result set is never larger than 30 rows, the ordering in a temporary file will not take up any time at all. Try it for yourself.
EDIT Forget that last sentence, that was nonsense. I just realised that the LIMIT clause is applied AFTER the sorting takes place. So: Yes, the sorting will take some time. If your query really only returns one numerical column, it should be quite fast, though. And one truth remains: An index on f1 doesn't make any difference. Plus: AFAIK, once the rows have all been retrieved, the table is not locked for any other access. And because that doesn't change, there is no impact on other users whether you use the ORDER BY clause or not.
Best Answer
First of all, concurrent writes are definitely not an option for MyISAM storage. Each of them will lock a whole table (except for reading in some cases). If InnoDB does not suite you well, try TokuDB. But it will be slower compared to MyISAM because of transactional nature of TokuDB (and InnoDB of course) engine (you should write the same data at least twice: journal and data files). Also, if your server will crash some day, you will be waiting for hours until your 40Gb MyISAM table repairs.
If you still want to load data into your MyISAM-tables and want to do it fast, I can recommend to use
LOAD DATA INFILE
instead of inserts. This is the fastest way to load large volumes of data to table. And yes, indexes will slow down insert performance in exponential way.A word about partitions: INSERT-statements in MySQL do not support pruning, so all your partitions will be scanned on each statement for unique index matching. Also, all partitions will be locked until insert ends.