Algorithms – Finding the Shortest Common Superstring

algorithmsstrings

Given some string fragments, I would like to find the shortest possible single string ("output string") that contains all the fragments. Fragments can overlap each other in the output string.

Example:

For the string fragments:

BCDA
AGF
ABC

The following output string contains all fragments, and was made by naive appending:

BCDAAGFABC

However this output string is better (shorter), as it employs overlaps:

ABCDAGF
^
ABC
 ^
 BCDA
    ^ 
    AGF

I'm looking for algorithms for this problem. It's not absolutely important to find the strictly shortest output string, but the shorter the better. I'm looking for an algorithm better than the obvious naive one that would try appending all permutations of the input fragments and removing overlaps (which would appear to be NP-Complete).

I've started work on a solution and it's proving quite interesting; I'd like to see what other people might come up with. I'll add my work-in-progress to this question in a while.

Best Answer

What you're asking about is the Shortest Common Superstring problem, for which there is no algorithm that works for all cases. But it is a common problem (in compression and DNA sequencing) and several approximation algorithms are well-known.

"Greedy" algorithms are generally accepted to be the most effective (as in, they have the least-bad worst-case).

Have a read of the paper Approximation Algorithms for the Shortest Common Superstring Problem by Jonathan Turner for much more information.

Related Topic