Merge directed acyclic graphs minimizing number of nodes

algorithmsdata structuresgraph

I have some DAGs (directed acylic graphs) and I want to merge them in order to minimize the number of nodes (we could say that every node has a cost, while edges are free).

These four different DAGs (directed from left to right)…

a-b-c
a-d-c
a-c
c-a

…should become:

 /---\
a--b--c-a
 \-d-/

This is not a real DAWG (directed acyclic word graph): I don't want to store information like "is 'adc' included?". My structure could only answer to this question: "it would be possible that 'adc' was one of the words?".

Is there an algorithm for this purpose?

Update (12/15/14) – Levenshtein distance

I tried something different: I used Levenshtein distance to find the minimum number of edits required in order to transform a string into another (node=character and chain=sequence of nodes/characters=word). My algorithm ignores deletion, and insert characters instead of replacing them. Here's the interesting part (Python code):

current = words[0]
for word in words[1:]:
    edit = editops(current, word)
    customEdit = [('insert', s, d) for op, s, d in edit if op != 'delete']
    current = apply_edit(customEdit, current, word)

Sometimes there are unneeded characters, so I remove them at the end of the process.
If I change the words order I get different results, so I run my code many times shuffling words in order to find a shorter string (shuffling seems to provide better result with fewer iterations than permutations, even if words are sorted by length).

If every character is a node, and every word is a DAG, I can easily get a good approximation of the DAG I'm looking for.

The main problem with this approach is that I don't know how the best result will look like so I don't know when to stop (I can't check every result of my permutation: it will take too much time!).

Here's my code (tested with Python 2; python-Levenshtein is needed). The output looks like this:

ldoarmilpesouimtr (17)
iapmdsoeluiortem (16)
diolposarmieutm (15)
^C
Original size: 22
Compressed: 15

diolposarmieutm (lorem)
diolposarmieutm (ipsum)
diolposarmieutm (dolor)
diolposarmieutm (sit)
diolposarmieutm (amet)

Is it a good way to solve this problem? What it could be improved? Do you know if it's possible to know the minimum number of nodes/characters needed in order to stop the algorithm when I get the optimum?

Best Answer

I think your structure is the line graph of the minimal DAWG. I have generated these before, three years ago, by building the minimal DAWG, from that its line graph, then minimising the line graph. I searched the literature and Google extensively and never found this last step. I concluded that the DAWG was more useful, generally, but the line DAWG was better for display to those unaccustomed to DAWGs. There is another name for a DAWG used in natural language processing, a Word Something, the something elludes me right now.

Your adc example suggests a DAWG where everyone node has a # edge to the sink.