MySQL retireive rows in sorted order on a very large table

myisamMySQL

I have a MyISAM table T with the following schema:

f1 (integer unsigned not null) f2 (integer unsigned not null)

This table has an index on f2 and it currently contains 320 million rows, and is expected to grow at the rate of about 200,000 rows once a week. I perform the following query on this table:

SELECT DISTINCT T.f1 FROM T WHERE f2=@Var LIMIT ?,30

@Var is a variable passed to the stored procedure which executes this query, and the LIMIT variable changes according to the page number being shown (starting at 0, etc)

The speed of retrieval is very good (considering that the table is very large) but the rows appear in the order in which they were written to the table (i.e. not in f1 order). I would like to be able to include the clause "ORDER BY f1 DESC" in the above query, however, doing this without an INDEX would be suicidal! (sometimes there may be over a million rows which satisfy the query and ordering them without an index would probably kill the server)

My question is…what index(es) should be present to cater for the query I am running and also for the ordering of the rows in the result? If the query and sorting cannot be satisfied using indexes, I was thinking of performing an ALTER TABLE T ORDER BY f1 DESC after an update (and while the users can still query the data). In that case, on my development machine, the alter statement took around 50 minutes which is not too bad. Obviously on the LIVE machine I would need to have as much disk free space as the size of the original table…Are there any other considerations that I need to take?

Thanks in advance, Tim

Best Answer

I am not sure your assumption that an ORDER BY clause would require an index on f1 is actually correct. I created such a table and ran

explain SELECT DISTINCT T.f1 as result FROM rowtest T WHERE f2=10 order by result LIMIT 0,30 

And I got back this:

id | select_type | table | type | possible_keys | key    | key_len | ref   | rows | Extra
1  | SIMPLE      | T     | ref  | idx_f2        | idx_f2 | 4       | const | 3    | Using where; Using temporary; Using filesort

Now the fact that the server will be using a temporary table and filesort is not hinting at a particularly fast or efficient way of doing this. However, there is nothing in there that says you need an index on f1. Ignore the fact that in my case there will only be 3 rows in the result set (I couldn't afford to create a table with 320 million rows).

Now: if I add an index to the table on column f1, the result of the explain doesn't change at all, which means whether you do or don't have an index doesn't matter.

The reason for this lies in the fact that the server first retrieves all the rows that satisfy the where condition (using the index on f2), and then orders them using a temporary file. During the retrieval of the rows the index on f1 is of no help, and during the ordering stage it is not present.

Considering that your result set is never larger than 30 rows, the ordering in a temporary file will not take up any time at all. Try it for yourself.

EDIT Forget that last sentence, that was nonsense. I just realised that the LIMIT clause is applied AFTER the sorting takes place. So: Yes, the sorting will take some time. If your query really only returns one numerical column, it should be quite fast, though. And one truth remains: An index on f1 doesn't make any difference. Plus: AFAIK, once the rows have all been retrieved, the table is not locked for any other access. And because that doesn't change, there is no impact on other users whether you use the ORDER BY clause or not.