C++ – Fast Algorithm to Search a Sorted Array of Floats

algorithmsc

I have an array of floats, sorted from smallest to largest, and need to be able to pick out the nearest float greater than or less than a passed input value. This input value is not necessarily present as a value in the array.

A naive approach would be to do a simple linear search through the array. That might look like this:

void FindClosestFloatsInArray( float input, std::vector<float> array, 
                               float *min_out, float *max_out )
{
    assert( input >= array[0] && input < array[ array.size()-1 ] );
    for( int i = 1; i < array.size(); i++ )
    {
        if ( array[i] >= input )
        {
            *min = array[i-1];
            *max = array[i];
        }
    }
}

But obviously as the array gets larger, this will become slower and slower.

Does anyone have an idea about an algorithm that would let me find this data more optimally? I've already switched to a binary search, which has improved matters somewhat, but it's still a lot slower than I'd like, and since I'm not actually looking for a specific value that exists in the array, it can never terminate early.

More information: The floating point values in the array are not necessarily distributed evenly (that is, the array could consist of the values "1.f, 2.f, 3.f, 4.f, 100.f, 1200.f, 1203.f, 1400.f".

I'm doing this operation hundreds of thousands of times, but I can do any amount of pre-processing on the array of floats, if it'll improve the lookup time. I absolutely can change to use something other than a vector to store them, if that'll help.

Best Answer

The code in the question (a linear search), as you rightly point out, is going to get slow for large float arrays. Technically it's O(n) where n is the number of float values in your array.

In general, the best that you can do to find a value in an ordered array is a recursive tree search of some kind (e.g. binary search), in which case you can achieve an O(log n) lookup time in the number of elements in your array. O(log n) is much better than O(n) for large values of n.

My suggested approach would therefore be a simple binary search of the array, i.e.:

Set min/max integer indexes to cover your whole float array
test the value in the middle of the range at index mid=(min+max/2) against the search value x
if x is lower than this value, set max to mid, else set min to mid
repeat (2-4) until you have found the correct value

This is an O(log n) algorithm which should be fast enough for nearly all situations. Intuitively, it works by halving the range to be searched at each step until you find the correct value.

It's really hard to beast the simple binary search, so if you have already implemented this correctly then you may be pretty close to optimal already. However, if you know the distributions of the data and/or have a limited range of lookup values (x), there are still some other more advanced tricks you can try:

Bucketing - create buckets (e.g. for each interval between two integers), each of which contains a smaller sorted list of the float values between the two bounding integers plus two values immediately below and immediately above each range. You can then start your search at (trunc(x)+0.5). This should give you a good speedup if you choose appropriately sized buckets (it's effectively increasing the branching factor of the tree.....). If integers don't work for you, then you can try buckets of some other fixed-point precision (e.g. multiples of 1/16).
Bit-mapping - if the range of possible lookup values is small enough, you could try creating a big lookup table indexed by the bitwise value of x. This will be O(1) but you may need a lot of memory which will be very unfriendly on your cache... so use with caution. This is especially nasty because you are looking up float values, so you may well need several GBs to account for all of the less significant bits......
Rounding and hashing - hash tables are probably not the best data structure for this problem, but if you can survive losing a bit of accuracy they could work - simply round off the lowest bits of your lookup values and use a hashmap to directly look up the correct value. You will have to experiment on the right trade-off between hashmap size and precision, and also ensure that all possible hash values are populated so this can be a bit tricky......
Tree-balancing - your ideal tree should have a 50% chance of going left or right. So if you create a tree based on the distribution of lookup values (x), then you can optimise the tree to produce answers with the minimal amount of tests. This is likely to be a good solution if a lot of values in your float array are very close together, since it will enable you to avoid searching these branches too often.
Crit-bit trees - these are still trees (so still O(log n)...) but some cases: you would however need to convert your floats into some fixed-point format in order to make the comparisons work

However, unless you are in a very special situation I'd probably recommend sticking with the simple binary search. Reasons:

it's much easier to implement
it's very fast for most common cases
the extra overhead of the more complex approaches (e.g. higher memory usage / cache pressure) often outweighs the minor theoretical gains
it will be more robust to future changes in the data distributions....

Related Solutions

Reading Specific Types of Input from a File in C++

You can differentiate numbers from words by looking at the characters within them. Basically, if a word contains exclusively [0-9.], you can assume it's a number.

In the same way, if an assumed number contains only [0-9], it's an integer. Otherwise, it's a float.

Beware of edge cases:

Cultures matter. In French, for example, you don't write 19.95, but 19,95. Even worse, the number 1 582.16 can be written 1582.16, 1,582.16, 1.582,16, etc., depending on the culture.
How do you escape strings? What if a multiword string contains a quote character?
Unicode can make things difficult, since unicode characters can be normalized and decomposed.
What if, one day, you need to have a word which has only digits, but still be interpreted as a word?

Side note: couldn't you reuse an existent format, like JSON, with already written and tested parsers? See also: How are new file formats constructed?

Algorithms – How to Sort an Array of Numbers with Decimals

Your problem is that your "numbers" don't have decimal places. You have a string which consists of two integers which are separated by .. Parse them as two integers and write a custom comparer. The sorting algorithm can remain the same, you just need to pass in the comparer.

In C# such a comparer could look similar to this:

void Test()
{
    var data=new string[]{ "1.34", "1.15", "1.1", "1.2", "1.26", "1.10", "1.20", "1.17" };
    Array.Sort(data, VersionComparer);
    data.Dump();
}

static int VersionComparer(string s1, string s2)
{
    List<int> parts1 = s1.Split('.').Select(int.Parse).ToList();
    List<int> parts2 = s2.Split('.').Select(int.Parse).ToList();

    while(parts1.Count < parts2.Count)
        parts1.Add(0);
    while(parts2.Count < parts1.Count)
        parts2.Add(0);

    for(int i = 0; i < parts1.Count; i++)
    {
         if(parts1[i] < parts2[i])
             return -1;
         if(parts1[i] > parts2[i])
             return +1;
    }
    return 0;
}

Best Answer

Related Solutions

Reading Specific Types of Input from a File in C++

Algorithms – How to Sort an Array of Numbers with Decimals

Related Topic