Sort algorithms that work on large amount of data

algorithmssorting

I am looking for sorting algorithms that can work on a large amount of data, i.e. that can work even when the whole data set cannot be held in main memory at once.

The only candidate that I have found up to now is merge sort: you can implement the algorithm in such a way that it scans your data set at each merge without holding all the data in main memory at once. The variation of merge sort I have in mind is described in this article in section Use with tape drives.

I think this is a good solution (with complexity O(n x log(n)) but I am curious to know if there are other (possibly faster) sorting algorithms that can work on large data sets that do not fit in main memory.

EDIT

Here are some more details, as required by the answers:

  • The data needs to be sorted periodically, e.g. once in a month. I do not need to insert a few records and have the data sorted incrementally.
  • My example text file is about 1 GB UTF-8 text, but I wanted to solve the problem in general, even if the file were, say, 20 GB.
  • It is not in a database and, due to other constraints, it cannot be.
  • The data is dumped by others as a text file, I have my own code to read this text file.
  • The format of the data is a text file: new line characters are record separators.

One possible improvement I had in mind was to split the file into files that are small enough to be sorted in memory, and finally merge all these files using the algorithm I have described above.

Best Answer

The canonical reference on sorting and searching is Knuth, Vol. 3. Start there.

The book was originally written back when computers were a lot smaller and slower than they are now, which made out-of-memory sorting techniques more important than they are perceived to be today.