Finding common prefixes for a set of strings

algorithmscomputer sciencedynamic programmingtrie

I am trying to find common prefixes for a sorted set of strings. i.e. if the following strings are given:

AB123456
AB123457
ABCDEFGH
ABCDEFGX1
ABCDEFGY
XXXX

then my function should return three prefixes and their suffixes:

AB12345  6,7
ABCDEFG  H,X1,Y
XXXX     (no suffixes)

Some background information: I'm trying to compress a large amount of sorted strings. A traditional implementation of prefix compression would just store the difference of each string to the previous string. This does not allow fast random inserts or lookups, because all the previous strings have to be decompressed first. That's why I want to find common prefixes. Each string will just store the difference to this common prefix. Then I get fast random access for the cost of a few additional bytes (compared to the traditional implementation).

I'm not yet having a really good idea how to implement this. In my dreams I imagine a window which slides over the input stream, trying to find the best result. This smells like dynamic programming, a topic that I haven't been in touch with since university (long long ago).

If, however, computing the "best" result turns out to be extremely performance intensive I am willing to use the second-best result. Performance is important.

EDIT:
After reading the first few answers, I understand that my question maybe was not precise enough. Maybe I can rephrase it a bit:

I am looking for the lowest cost (= minimum space utilization) in a graph. The graph starts with a sorted set of unique strings. (They are not compressed and therefore require the maximum space.) I now want to find common prefixes of the strings, so that the utilized space can be reduced. There should be just one level of prefixes, i.e. no prefix hierarchy (as it would be assumed in a trie).

Best Answer

You want to store strings in compressed form (to save space, I guess), but you want fast lookup, is this right? If I were you, I would go for the speed and use a trie (for the first few characters). It has O(log n) lookup, and it will automatically condense common prefixes.

A lot depends on the statistics of the strings, like how many there are, and their typical length.

ADDED: For the strings you gave, the trie would look like this:

- A B - 1 2 3 4 5 - 6 .
  |     |           |
  |     |           7 .
  |     |           
  |     C D E F G - H .
  |                 |
  |                 X 1 .
  |                 |
  |                 Y .
  |
  X X X X .

Each node of the trie contains a little "dictionary" of "words", initially 1 letter long, and each "word" points to a sub-node. If that sub-node contains only one "word" in its own "dictionary", then that "word" can be absorbed into its parent's "word", and that's how you build up the prefixes.

Related Topic