Finding common prefixes for a set of strings

algorithmscomputer sciencedynamic programmingtrie

I am trying to find common prefixes for a sorted set of strings. i.e. if the following strings are given:

AB123456
AB123457
ABCDEFGH
ABCDEFGX1
ABCDEFGY
XXXX

then my function should return three prefixes and their suffixes:

AB12345  6,7
ABCDEFG  H,X1,Y
XXXX     (no suffixes)

Some background information: I'm trying to compress a large amount of sorted strings. A traditional implementation of prefix compression would just store the difference of each string to the previous string. This does not allow fast random inserts or lookups, because all the previous strings have to be decompressed first. That's why I want to find common prefixes. Each string will just store the difference to this common prefix. Then I get fast random access for the cost of a few additional bytes (compared to the traditional implementation).

I'm not yet having a really good idea how to implement this. In my dreams I imagine a window which slides over the input stream, trying to find the best result. This smells like dynamic programming, a topic that I haven't been in touch with since university (long long ago).

If, however, computing the "best" result turns out to be extremely performance intensive I am willing to use the second-best result. Performance is important.

EDIT:
After reading the first few answers, I understand that my question maybe was not precise enough. Maybe I can rephrase it a bit:

I am looking for the lowest cost (= minimum space utilization) in a graph. The graph starts with a sorted set of unique strings. (They are not compressed and therefore require the maximum space.) I now want to find common prefixes of the strings, so that the utilized space can be reduced. There should be just one level of prefixes, i.e. no prefix hierarchy (as it would be assumed in a trie).

Best Answer

You want to store strings in compressed form (to save space, I guess), but you want fast lookup, is this right? If I were you, I would go for the speed and use a trie (for the first few characters). It has O(log n) lookup, and it will automatically condense common prefixes.

A lot depends on the statistics of the strings, like how many there are, and their typical length.

ADDED: For the strings you gave, the trie would look like this:

- A B - 1 2 3 4 5 - 6 .
  |     |           |
  |     |           7 .
  |     |           
  |     C D E F G - H .
  |                 |
  |                 X 1 .
  |                 |
  |                 Y .
  |
  X X X X .

Each node of the trie contains a little "dictionary" of "words", initially 1 letter long, and each "word" points to a sub-node. If that sub-node contains only one "word" in its own "dictionary", then that "word" can be absorbed into its parent's "word", and that's how you build up the prefixes.

Related Solutions

Merge sort versus quick sort performance

If you look at your code for swapping you:

// If current element is lower than pivot
// then swap it with the element at store_index
// and move the store_index to the right.

But, ~50% of the time that string you just swapped needs to be moved back, which is why faster merge sorts work from both ends at the same time.

Next if you check to see if the first and last elements are the same before doing each of the recursive call you avoid wasting time calling a function only to quickly exit it. This happens 10000000 in your final test which does add noticeable amounts of time.

Use,

if (pivot_index -1 > start) quick_sort(lines, start, pivot_index - 1);

if (pivot_index + 1 < end) quick_sort(lines, pivot_index + 1, end);

You still want an outer function to do an initial if (start < end) but that only needs to happen once so that function can just call an unsafe version of your code without that outer comparison.

Also, picking a random pivot tends to avoid N^2 worst case results, but it's probably not a big deal with your random data set.

Finally, the hidden problem is QuickSort is comparing strings in ever smaller buckets that are ever closer together,

(Edit: So, AAAAA, AAAAB, AAAAC, AAAAD then AAAAA, AAAAB. So, strcmp needs to step though a lot of A's before looking the useful parts of the strings.)

but with Merge sort you look at the smallest buckets first while they are vary random. Mergsorts final passes do compare a lot of strings close to each other, but it's less of an issue then. One way to make Quick sorts faster for strings is to compare the first digits of the outer strings and if there the same ignore them when doing the inner comparisons, but you have to be careful that all strings have enough digits that your not skipping past the null terminator.

Unicode – Efficient Trie Implementation for Unicode Strings

What are you using this trie for? What is the total number of words that you plan to hold, and what is the sparseness of their constituent characters? And most important, is a trie even appropriate (versus a simple map of prefix to list of words)?

Your idea of an intermediate table and replacing pointers with indexes will work, provided that you have a relatively small set of short words and a sparse character set. Otherwise you risk running out of space in your intermediate table. And unless you're looking at an extremely small set of words, you won't really save that much space: 2 bytes for a short versus 4 bytes for a reference on a 32-bit machine. If you're running on a 64-bit JVM, the savings will be more.

Your idea about breaking the characters into 4-bit chunks probably won't save you much, unless all of your expected characters are in an extremely limited range (maybe OK for words limited to uppercase US-ASCII, not likely with a general Unicode corpus).

If you have a sparse character set, then a HashMap<Character,Map<...>> might be your best implementation. Yes, each entry will be much larger, but if you don't have many entries you'll get an overall win. (as a side note: I always thought it was funny that the Wikipedia article on Tries showed -- maybe still does -- an example based on a hashed data structure, completely ignoring the space/time tradeoffs of that choice)

Finally, you might want to avoid a trie altogether. If you're looking at a corpus of normal words in a human language (10,000 words in active use, with words 4-8 characters long), you'll probably be MUCH better off with a HashMap<String,List<String>, where the key is the entire prefix.

Best Answer

Related Solutions

Merge sort versus quick sort performance

Unicode – Efficient Trie Implementation for Unicode Strings

Related Topic