Algorithm to merge two sorted arrays with minimum number of comparisons

algorithmsbig ocomparison

Given are two sorted arrays a, b of type T with size n and m. I am looking for an algorithm that merges the two arrays into a new array (of maximum size n+m).

If you have a cheap comparison operation, this is pretty simple. Just take from the array with the lowest first element until one or both arrays are completely traversed, then add the remaining elements. Something like this https://stackoverflow.com/questions/5958169/how-to-merge-two-sorted-arrays-into-a-sorted-array

However, the situation changes when comparing two elements is much more expensive than copying an element from the source array to the target array. For example you might have an array of large arbitrary precision integers, or strings, where a comparison can be quite expensive. Just assume that creating arrays and copying elements is free, and the only thing that costs is comparing elements.

In this case, you want to merge the two arrays with a minimum number of element comparisons. Here are some examples where you should be able to do much better than the simple merge algorithm:

a = [1,2,3,4, ... 1000]
b = [1001,1002,1003,1004, ... 2000]

Or

a = [1,2,3,4, ... 1000]
b = [0,100,200, ... 1000]

There are some cases where the simple merge algorithm will be optimal, like

a = [1,3,5,7,9,....,999]
b = [2,4,6,8,10,....,1000]

So the algorithm should ideally gracefully degrade and perform a maximum of n+m-1 comparisons in case the arrays are interleaved, or at least not be significantly worse.

One thing that should do pretty well for lists with a large size difference would be to use binary search to insert the elements of the smaller array into the larger array. But that won't degrade gracefully in case both lists are of the same size and interleaved.

The only thing available for the elements is a (total) ordering function, so any scheme that makes comparisons cheaper is not possible.

Any ideas?

I have come up with this bit in Scala. I believe it is optimal regarding the number of comparisons, but it is beyond my ability to prove it. At least it is much simpler than the things I have found in the literature.

And since the original posting, I wrote a blog post about how this works.

Best Answer

The normal merge sort algorithm - merge step with normally apply n + m -1 comparisons, where one list is of size n and and the other list is of size m. Using this algorithm is the most simplest approach to combine two sorted lists.

If the comparisons are too expensive you could do two things - either you minimize the number of comparisons or you minimize the cost of comparisons.

Let's focus on the minimization of the cost of comparison. You and only you can decide whether the data you are comparing can be quantized or not. If you can quantize them, which is a form of implementing a hash method, which is keeping the ordering. E.g. if your Data is compared by Name, Then the first tname, ... you can take the first to Chars of the name "Klaehn, Ruediger" and reduce/quantize your data element to "Kl.Ru", if you compare it to "Packer, The" you preserve the ordering "Pa.Th" - you can now apply a cheaper comparison algorithm, comparing the reduced values. But if you find another "Kl.Ru", you now have a near value, and you might now switch to a more expensive approach comparing these elements.

If you can extract this quantized value from your data, faster than comparing it, this is the first thing you do, you compare the quantized or hashed value first. Please keep in mind, that this value needs to be computed only once, so you can compute it on creating the data element.

I also mentioned another way, to minimize your comparisons.

I had a look into the classic book TAOCP- Volume 3-Sorting and Searching, (pp.197-207, section 5.3.2) which has full 10 pages on this topic. I found two references to algorithms which are faster than n+m-1 comparisons.

First there is the Hwang-Lin merge algorithm and the second an improvement by Glenn K Manacher - both are cited by TAOCP as well as an algorithm by Christen, which approaches the lower bound of needed comparisons, on special conditions on the length n and m of the lists.

The algorithm of Manacher was presented in Journal of the ACM Vol. 26 Number 3 on pages 434-440: "Significant Improvements to the "Hwan-Lin" Merging Algorithm". the list with m items and the list with n items can be of different length, but they must also be odered by the numbers of elements they contain m<=n

The Hwang-Lin algorithm breaks the lists to merge, apart to smaller lists and sorts the lists by comparing the first element of each sublist, and to decide whether some elements in the sublist need to be compared or not. If the first list is smaller than the second list, then the chance is high, that consecutive elements of the longer list can be transferred into the resulting list without comparison. If the first element of the small ist is greater than the first element of the splitted larger list, all elements in front of sublist can be copied without comparison.

Average case analysis of the merging alorithm of Hwang and Lin (Vega, Frieze, Santha) in Section 2 you can find a pseudocode of the HL-Algorithm. Which is a lot better than my description. And you can see why there are fewer comparisons - the algorithm uses a binary search, to find the index, where to insert the element from the shorter list.

If the lists are not interleaved like in your last example, you should have a remaining smaller and a remaining larger list in most cases. This is when the the HL-algorithm starts to perform better.