Algorithm to find top 10 search terms

algorithmdata structures

I'm currently preparing for an interview, and it reminded me of a question I was once asked in a previous interview that went something like this:

"You have been asked to design some software to continuously display the top 10 search terms on Google. You are given access to a feed that provides an endless real-time stream of search terms currently being searched on Google. Describe what algorithm and data structures you would use to implement this. You are to design two variations:

(i) Display the top 10 search terms of all time (i.e. since you started reading the feed).

(ii) Display only the top 10 search terms for the past month, updated hourly.

You can use an approximation to obtain the top 10 list, but you must justify your choices."

I bombed in this interview and still have really no idea how to implement this.

The first part asks for the 10 most frequent items in a continuously growing sub-sequence of an infinite list. I looked into selection algorithms, but couldn't find any online versions to solve this problem.

The second part uses a finite list, but due to the large amount of data being processed, you can't really store the whole month of search terms in memory and calculate a histogram every hour.

The problem is made more difficult by the fact that the top 10 list is being continuously updated, so somehow you need to be calculating your top 10 over a sliding window.

Any ideas?

Best Answer

Frequency Estimation Overview

There are some well-known algorithms that can provide frequency estimates for such a stream using a fixed amount of storage. One is Frequent, by Misra and Gries (1982). From a list of n items, it find all items that occur more than n / k times, using k - 1 counters. This is a generalization of Boyer and Moore's Majority algorithm (Fischer-Salzberg, 1982), where k is 2. Manku and Motwani's LossyCounting (2002) and Metwally's SpaceSaving (2005) algorithms have similar space requirements, but can provide more accurate estimates under certain conditions.

The important thing to remember is that these algorithms can only provide frequency estimates. Specifically, the Misra-Gries estimate can under-count the actual frequency by (n / k) items.

Suppose that you had an algorithm that could positively identify an item only if it occurs more than 50% of the time. Feed this algorithm a stream of N distinct items, and then add another N - 1 copies of one item, x, for a total of 2N - 1 items. If the algorithm tells you that x exceeds 50% of the total, it must have been in the first stream; if it doesn't, x wasn't in the initial stream. In order for the algorithm to make this determination, it must store the initial stream (or some summary proportional to its length)! So, we can prove to ourselves that the space required by such an "exact" algorithm would be Ω(N).

Instead, these frequency algorithms described here provide an estimate, identifying any item that exceeds the threshold, along with some items that fall below it by a certain margin. For example the Majority algorithm, using a single counter, will always give a result; if any item exceeds 50% of the stream, it will be found. But it might also give you an item that occurs only once. You wouldn't know without making a second pass over the data (using, again, a single counter, but looking only for that item).

The Frequent Algorithm

Here's a simple description of Misra-Gries' Frequent algorithm. Demaine (2002) and others have optimized the algorithm, but this gives you the gist.

Specify the threshold fraction, 1 / k; any item that occurs more than n / k times will be found. Create an an empty map (like a red-black tree); the keys will be search terms, and the values will be a counter for that term.

Look at each item in the stream.
If the term exists in the map, increment the associated counter.
Otherwise, if the map less than k - 1 entries, add the term to the map with a counter of one.
However, if the map has k - 1 entries already, decrement the counter in every entry. If any counter reaches zero during this process, remove it from the map.

Note that you can process an infinite amount of data with a fixed amount of storage (just the fixed-size map). The amount of storage required depends only on the threshold of interest, and the size of the stream does not matter.

Counting Searches

In this context, perhaps you buffer one hour of searches, and perform this process on that hour's data. If you can take a second pass over this hour's search log, you can get an exact count of occurrences of the top "candidates" identified in the first pass. Or, maybe its okay to to make a single pass, and report all the candidates, knowing that any item that should be there is included, and any extras are just noise that will disappear in the next hour.

Any candidates that really do exceed the threshold of interest get stored as a summary. Keep a month's worth of these summaries, throwing away the oldest each hour, and you would have a good approximation of the most common search terms.

Gray Codes

An issue that you will come across is of course memory and pretty quickly, you'll have problems by 20 elements in your set -- ²⁰C₃ = 1140. And if you want to iterate over the set it's best to use a modified gray code algorithm so you aren't holding all of them in memory. These generate the next combination from the previous and avoid repetitions. There are many of these for different uses. Do we want to maximize the differences between successive combinations? minimize? et cetera.

Some of the original papers describing gray codes:

Here are some other papers covering the topic:

An Efficient Implementation of the Eades, Hickey, Read Adjacent Interchange Combination Generation Algorithm (PDF, with code in Pascal)
Combination Generators
Survey of Combinatorial Gray Codes (PostScript)
An Algorithm for Gray Codes

Chase's Twiddle (algorithm)

Phillip J Chase, `Algorithm 382: Combinations of M out of N Objects' (1970)

The algorithm in C...

Index of Combinations in Lexicographical Order (Buckles Algorithm 515)

You can also reference a combination by its index (in lexicographical order). Realizing that the index should be some amount of change from right to left based on the index we can construct something that should recover a combination.

So, we have a set {1,2,3,4,5,6}... and we want three elements. Let's say {1,2,3} we can say that the difference between the elements is one and in order and minimal. {1,2,4} has one change and is lexicographically number 2. So the number of 'changes' in the last place accounts for one change in the lexicographical ordering. The second place, with one change {1,3,4} has one change but accounts for more change since it's in the second place (proportional to the number of elements in the original set).

The method I've described is a deconstruction, as it seems, from set to the index, we need to do the reverse – which is much trickier. This is how Buckles solves the problem. I wrote some C to compute them, with minor changes – I used the index of the sets rather than a number range to represent the set, so we are always working from 0...n. Note:

Since combinations are unordered, {1,3,2} = {1,2,3} --we order them to be lexicographical.
This method has an implicit 0 to start the set for the first difference.

Index of Combinations in Lexicographical Order (McCaffrey)

There is another way:, its concept is easier to grasp and program but it's without the optimizations of Buckles. Fortunately, it also does not produce duplicate combinations:

The set that maximizes , where .

For an example: 27 = C(6,4) + C(5,3) + C(2,2) + C(1,1). So, the 27th lexicographical combination of four things is: {1,2,5,6}, those are the indexes of whatever set you want to look at. Example below (OCaml), requires choose function, left to reader:

(* this will find the [x] combination of a [set] list when taking [k] elements *)
let combination_maccaffery set k x =
    (* maximize function -- maximize a that is aCb              *)
    (* return largest c where c < i and choose(c,i) <= z        *)
    let rec maximize a b x =
        if (choose a b ) <= x then a else maximize (a-1) b x
    in
    let rec iterate n x i = match i with
        | 0 -> []
        | i ->
            let max = maximize n i x in
            max :: iterate n (x - (choose max i)) (i-1)
    in
    if x < 0 then failwith "errors" else
    let idxs =  iterate (List.length set) x k in
    List.map (List.nth set) (List.sort (-) idxs)

A small and simple combinations iterator

The following two algorithms are provided for didactic purposes. They implement an iterator and (a more general) folder overall combinations. They are as fast as possible, having the complexity O(ⁿC_k). The memory consumption is bound by k.

We will start with the iterator, which will call a user provided function for each combination

let iter_combs n k f =
  let rec iter v s j =
    if j = k then f v
    else for i = s to n - 1 do iter (i::v) (i+1) (j+1) done in
  iter [] 0 0

A more general version will call the user provided function along with the state variable, starting from the initial state. Since we need to pass the state between different states we won't use the for-loop, but instead, use recursion,

let fold_combs n k f x =
  let rec loop i s c x =
    if i < n then
      loop (i+1) s c @@
      let c = i::c and s = s + 1 and i = i + 1 in
      if s < k then loop i s c x else f c x
    else x in
  loop 0 0 [] x

.net – the best algorithm for overriding GetHashCode

I usually go with something like the implementation given in Josh Bloch's fabulous Effective Java. It's fast and creates a pretty good hash which is unlikely to cause collisions. Pick two different prime numbers, e.g. 17 and 23, and do:

public override int GetHashCode()
{
    unchecked // Overflow is fine, just wrap
    {
        int hash = 17;
        // Suitable nullity checks etc, of course :)
        hash = hash * 23 + field1.GetHashCode();
        hash = hash * 23 + field2.GetHashCode();
        hash = hash * 23 + field3.GetHashCode();
        return hash;
    }
}

As noted in comments, you may find it's better to pick a large prime to multiply by instead. Apparently 486187739 is good... and although most examples I've seen with small numbers tend to use primes, there are at least similar algorithms where non-prime numbers are often used. In the not-quite-FNV example later, for example, I've used numbers which apparently work well - but the initial value isn't a prime. (The multiplication constant is prime though. I don't know quite how important that is.)

This is better than the common practice of XORing hashcodes for two main reasons. Suppose we have a type with two int fields:

XorHash(x, x) == XorHash(y, y) == 0 for all x, y
XorHash(x, y) == XorHash(y, x) for all x, y

By the way, the earlier algorithm is the one currently used by the C# compiler for anonymous types.

This page gives quite a few options. I think for most cases the above is "good enough" and it's incredibly easy to remember and get right. The FNV alternative is similarly simple, but uses different constants and XOR instead of ADD as a combining operation. It looks something like the code below, but the normal FNV algorithm operates on individual bytes, so this would require modifying to perform one iteration per byte, instead of per 32-bit hash value. FNV is also designed for variable lengths of data, whereas the way we're using it here is always for the same number of field values. Comments on this answer suggest that the code here doesn't actually work as well (in the sample case tested) as the addition approach above.

// Note: Not quite FNV!
public override int GetHashCode()
{
    unchecked // Overflow is fine, just wrap
    {
        int hash = (int) 2166136261;
        // Suitable nullity checks etc, of course :)
        hash = (hash * 16777619) ^ field1.GetHashCode();
        hash = (hash * 16777619) ^ field2.GetHashCode();
        hash = (hash * 16777619) ^ field3.GetHashCode();
        return hash;
    }
}

Note that one thing to be aware of is that ideally you should prevent your equality-sensitive (and thus hashcode-sensitive) state from changing after adding it to a collection that depends on the hash code.

As per the documentation:

You can override GetHashCode for immutable reference types. In general, for mutable reference types, you should override GetHashCode only if:

You can compute the hash code from fields that are not mutable; or

You can ensure that the hash code of a mutable object does not change while the object is contained in a collection that relies on its hash code.

The link to the FNV article is broken but here is a copy in the Internet Archive: Eternally Confuzzled - The Art of Hashing