I will try to explain my objective with an example which will be easier to understand.
Suppose I have a sentence like "A B C D E F G H".(Each word seperated using single space).
I have a Database like:
B C D E
A B
F G
C D E F G
..
I need to find the maximum number of units from the database that match the sentence. The outcome should be such that "C D E F G" should be disocvered first instead of "B C D E" and after finding "C D E F G" only the remaining part of the sentence should be matched against the database. Any change/suggestion is welcomed.
The maximum match units can be anywhere in the sentence.The problem here is that I could not come up with an algorithm which can accomplish this task.
Best Answer
I would also suggest to pre-process your database into a trie and let me extend the corresponding comment given by Christophe:
Take a look at the Aho-Corasick algorithm for the construction of such a trie. The general idea is to have a tree-based data structure that you can follow.
In your case, let's consider you also had the database entry
A B C E
. You start examining the trie withA
(because that's the first character in your query string). There is such a node, because your valid patterns includeA B
andA B C E
. You then look at the next character of your query,B
, and find a first complete matchA B
. Since you need the longest, you'd want to continue of course. The node you reached viaA B
is also the same node that continues on to formA B C E
. So you continue on withC
and then when you want to continue withD
you have no valid pattern with a prefix ofA B C D
left.First off: you could just repeat this process starting at
B
and so on for every letter of your query. That'll give you the correct answer via the final maximum over the length of all found patterns. It's not very efficient though, since you can imagine running over the same substrings over and over.Enter the magic of the shortcut links: Your trie contains a node representing the prefix
A B C
(from theA B C E
pattern) and your query matched that prefix. You then know that the query minus the exhaustedA
starts withB C
. You also know in advance, thatB C D E
hasB C
as a prefix. So your trie can be given a shortcut from the nodeA B C
to theB C
node. You do not need to start over, but can continue directly withD
from your query. And thenE
and you matchedB C D E
. Since there is no longer match forF
available, you need to rinse and repeat.Again, the shortcut from the
B C D E
node to theC D E
node allows you to recognizeC D E F G
within two more steps. By now, you have discoveredA B
,B C D E
andC D E F G
while only looking at the input charactersA B C D E F G
once.For completeness, after you matched
C D E F G
, you would jump directly to matchingF G
. Since your best available shortcut (with the limited example words you gave) isF G
- there is nothing starting withD E F G
and nothing starting withE F G
, soF G
is your best bet and the shortcut points to that node.Special note: You may have to take special care to recognize sub-pattern words. For example, consider the database containing
A B C D E F G
andC D E
and your input isA B C D E F H
. You will end up in theA B C D E F
node and have no more shortcut available, so you'll start from the beginning withH
, but meanwhile you have to take care of having matchedC D E
.This approach, however, requires you to pre-process your entire database up front. On the plus side, you get very fast linear searches (linear in several parameters - you can read up the details on the linked wikipedia page).