If you look at your code for swapping you:
// If current element is lower than pivot
// then swap it with the element at store_index
// and move the store_index to the right.
But, ~50% of the time that string you just swapped needs to be moved back, which is why faster merge sorts work from both ends at the same time.
Next if you check to see if the first and last elements are the same before doing each of the recursive call you avoid wasting time calling a function only to quickly exit it. This happens 10000000 in your final test which does add noticeable amounts of time.
Use,
if (pivot_index -1 > start)
quick_sort(lines, start, pivot_index - 1);
if (pivot_index + 1 < end)
quick_sort(lines, pivot_index + 1, end);
You still want an outer function to do an initial if (start < end) but that only needs to happen once so that function can just call an unsafe version of your code without that outer comparison.
Also, picking a random pivot tends to avoid N^2 worst case results, but it's probably not a big deal with your random data set.
Finally, the hidden problem is QuickSort is comparing strings in ever smaller buckets that are ever closer together,
(Edit: So, AAAAA, AAAAB, AAAAC, AAAAD then AAAAA, AAAAB. So, strcmp needs to step though a lot of A's before looking the useful parts of the strings.)
but with Merge sort you look at the smallest buckets first while they are vary random. Mergsorts final passes do compare a lot of strings close to each other, but it's less of an issue then. One way to make Quick sorts faster for strings is to compare the first digits of the outer strings and if there the same ignore them when doing the inner comparisons, but you have to be careful that all strings have enough digits that your not skipping past the null terminator.
Has anyone tested dual pivot quicksort performance with expensive-to-swap elements? It seems that in this case, it should massively underperform compared to standard quicksort.
...
During some research I also came upon dual pivot quicksort, which is the current implementation of quicksort in Java standard library. Generally it claims that it is always at least as good as standard quicksort, and empirical testing seemed to support it. (Which is the reason it is the current implementation.)
Be careful interpreting these claims. Many times, the comparison is against a "strawman" single-pivot quicksort implementation (usually called ''classic'' (or as you say, ''standard'') quicksort) which uses a random pivot. This is sometimes also disguised by using unusual notation. E.g. reporting 2 N ln N comparisons for "single-pivot" quicksort. That translates to 1.386 N log (base 2) N comparisons, which is characteristic of selecting a single random element as the pivot. Random pivot selection not only has poor performance (even the worst performing single-pivot quicksort implementation of qsort in widespread use is closer to 1.15 N log (base 2) N comparisons), but it leads to difficult-to-maintain code (you want to replicate a bug; what implementation of random number generation was used, and what was its state at the time the bug happened?).
However, it seems that no STL implementation uses dual pivot quicksort for the quicksort phase of introsort, which made me wonder why. After more research I found this paper. It says that while dual pivot quicksort performs on average 5% less comparisons, it performs significantly more swaps. (Approximately 80% more) Obviously, since Java has only primitives and reference types, swapping is always cheap. (Even so, it uses this sort only for primitives, because it is not stable)
Again, be careful with these comparisons. "5% less" under what specific conditions? Under the best possible conditions (zero-cost, "perfect" pivot (median for single-pivot, tertiles for dual-pivot)) and uniformly-distributed random input, dual-pivot quicksort will use more than 5% more comparisons than single-pivot quicksort (5/3 N log (base 3) N ~~ 1.052 N log (base 2) N vs. 1 N log (base 2) N. Swapping also depends on the implementation. Single-pivot quicksort (conditions as specified above) is expected to use 0.25 N log (base 2) N swaps if implemented efficiently. A dual-pivot implementation could theoretically achieve 1/3 N log (base 3) N ~~ 0.21 N log (base 2) N swaps (16% less), but it requires a great deal of bookkeeping; more typical would be 0.28 N log (base 2) N (12% more). Note that there are many low-cost, highly-effective ways to approximate the median (i.e. pseudomedian) for single-pivot quicksort. Not so much for tertiles.
One probably wouldn't want to use Musser introsort (recursion depth limit) with a multi-pivot scheme. Recursion depth isn't well defined in such a case (consider that dual-pivot can behave as single-pivot if the two pivots happen to have close values, so would you compare to some multiple of the base 2 or the base 3 logarithm of array size?). Valois introsort (randomly shuffling elements) has other issues (see above re. replicating bugs).
It also seems that at least part of the advantage of dual pivot quicksort is in its improved cache behaviour (Because it divides into smaller subarray that can fit into cache faster).
That has been conjectured, but not definitively demonstrated. It may be a red herring; quicksort is mostly cache-oblivious as accesses tend to be sequential.
So I wanted to see whether someone already tested standard quicksort vs dual pivot quicksort when elements are expensive to swap and has the numbers (and possibly source) lying around, or whether I will have to test this myself.
There's code (in C) including a testing framework at https://github.com/brucelilly/quickselect and multi-pivot issues are discussed in some detail in https://github.com/brucelilly/quickselect/blob/master/lib/libmedian/doc/pub/generic/paper.pdf. The code includes two dual-pivot implementations and many single-pivot implementations. A highly-tuned dual-pivot implementation indeed is a bit more than 1.052 N log N comparisons asymptotically, and several single-pivot implementations are lower, closer to 1 N log N comparisons (the paper includes several performance graphs). I haven't attempted to minimize the swaps in the dual-pivot code; the necessary bookkeeping is really onerous, and as comparisons outnumber swaps, swaps would have to be really expensive to be a factor, and in such a case one would probably use indirection (rearranging pointers to data).
Best Answer
When you find a statement like "M number of comparisons" about sorting algorithms in literature, the author typically means the number of comparisons between the sorted elements, not the comparisons of something like loop indexes. So if you are asked this in a university assignment, I would guess that this is the number which is meant (but to make that clear, if you give an answer, why not add a note describing exactly what you counted).
In fact, this assumes that the predominant operations in a sorting algorithm are comparisons and "swaps", and that there are not more loop iterations than comparisons between the sorted elements. When you assume the latter, for calculating Big-O it is irrelevant if you count the "loop index comparisons" or not.
Furthermore, note that a term like "exact number of comparisons" instead of "O(number of comparisons)" makes only sense for a specific implementation of an algorithm, not "for an algorithm" in general. So when you try to compare the number of comparisons of your implementation of bubble sort to a different implementation from a different author, you have to make sure you do not compare apples and oranges. That is why comparisons in terms of "Big-O" makes more sense when comparing algorithms on a theoretical level.
However, from a more practical point of view, counting the exact "number of comparisons" (or swaps) will indeed make sense when it comes to speed comparisons of specific sorting implementations on real-world data.