It being understood that the worst case is O(N)
, there are some very nice micro-optimizations.
The naive method performs a character comparison and an end-of-text comparison for each character.
Using a sentinel (i.e. a copy of the target character at the end of the text) reduces the number of comparisons to 1 per character.
At the bit twiddling level there is:
#define haszero(v) ( ((v) - 0x01010101UL) & ~(v) & 0x80808080UL )
#define hasvalue(x, n) ( haszero((x) ^ (~0UL / 255 * (n))) )
to know if any byte in a word (x
) has a specific value (n
).
The subexpression v - 0x01010101UL
, evaluates to a high bit set in any byte whenever the corresponding byte in v
is zero or greater than 0x80
.
The sub-expression ~v & 0x80808080UL
evaluates to high bits set in bytes where the byte of v
doesn't have its high bit set (so the byte was less than 0x80
).
By ANDing these two sub-expressions (haszero
) the result is the high bits set where the bytes in v
were zero, since the high bits set due to a value greater than 0x80
in the first sub-expression are masked off by the second (April 27, 1987 by Alan Mycroft).
Now we can XOR the value to test (x
) with a word that has been filled with the byte value in which we're interested (n
). Because XORing a value with itself results in a zero byte and nonzero otherwise, we can pass the result to haszero
.
This is often used in a typical strchr
implementation.
(Stephen M Bennet suggested this on December 13, 2009. Further details in the well known Bit Twiddling Hacks).
PS
this code is broken for any combination of 1111
's next to a 0
The hack passes the brute force test (just be patient):
#include <iostream>
#include <limits>
bool haszero(std::uint32_t v)
{
return (v - std::uint32_t(0x01010101)) & ~v & std::uint32_t(0x80808080);
}
bool hasvalue(std::uint32_t x, unsigned char n)
{
return haszero(x ^ (~std::uint32_t(0) / 255 * n));
}
bool hasvalue_slow(std::uint32_t x, unsigned char n)
{
for (unsigned i(0); i < 32; i += 8)
if (((x >> i) & 0xFF) == n)
return true;
return false;
}
int main()
{
const std::uint64_t stop(std::numeric_limits<std::uint32_t>::max());
for (unsigned c(0); c < 256; ++c)
{
std::cout << "Testing " << c << std::endl;
for (std::uint64_t w(0); w != stop; ++w)
{
if (w && w % 100000000 == 0)
std::cout << w * 100 / stop << "%\r" << std::flush;
const bool h(hasvalue(w, c));
const bool hs(hasvalue_slow(w, c));
if (h != hs)
std::cerr << "hasvalue(" << w << ',' << c << ") is " << h << '\n';
}
}
return 0;
}
Lots of upvotes for an answer which makes the assumption one chararacter=one byte, which is nowadays not the standard any more
Thank you for the remark.
The answer was meant to be anything but an essay on multi-byte / variable-width encodings :-) (in all fairness that's not my area of expertise and I'm not sure it's what the OP was looking for).
Anyway it seems to me that the above ideas/tricks could somewhat be adapted to MBE (especially self-synchronizing encodings):
- as noted in Johan's comment the hack can 'easily' be extended to work for double bytes or anything (of course you cannot stretch it too much);
- a typical function that locates a character in a multibyte character string:
- the sentinel technique can be used with a little foresight.
I would also suggest to pre-process your database into a trie and let me extend the corresponding comment given by Christophe:
Take a look at the Aho-Corasick algorithm for the construction of such a trie. The general idea is to have a tree-based data structure that you can follow.
In your case, let's consider you also had the database entry A B C E
. You start examining the trie with A
(because that's the first character in your query string). There is such a node, because your valid patterns include A B
and A B C E
. You then look at the next character of your query, B
, and find a first complete match A B
. Since you need the longest, you'd want to continue of course. The node you reached via A B
is also the same node that continues on to form A B C E
. So you continue on with C
and then when you want to continue with D
you have no valid pattern with a prefix of A B C D
left.
First off: you could just repeat this process starting at B
and so on for every letter of your query. That'll give you the correct answer via the final maximum over the length of all found patterns. It's not very efficient though, since you can imagine running over the same substrings over and over.
Enter the magic of the shortcut links: Your trie contains a node representing the prefix A B C
(from the A B C E
pattern) and your query matched that prefix. You then know that the query minus the exhausted A
starts with B C
. You also know in advance, that B C D E
has B C
as a prefix. So your trie can be given a shortcut from the node A B C
to the B C
node. You do not need to start over, but can continue directly with D
from your query. And then E
and you matched B C D E
. Since there is no longer match for F
available, you need to rinse and repeat.
Again, the shortcut from the B C D E
node to the C D E
node allows you to recognize C D E F G
within two more steps. By now, you have discovered A B
, B C D E
and C D E F G
while only looking at the input characters A B C D E F G
once.
For completeness, after you matched C D E F G
, you would jump directly to matching F G
. Since your best available shortcut (with the limited example words you gave) is F G
- there is nothing starting with D E F G
and nothing starting with E F G
, so F G
is your best bet and the shortcut points to that node.
Special note: You may have to take special care to recognize sub-pattern words. For example, consider the database containing A B C D E F G
and C D E
and your input is A B C D E F H
. You will end up in the A B C D E F
node and have no more shortcut available, so you'll start from the beginning with H
, but meanwhile you have to take care of having matched C D E
.
This approach, however, requires you to pre-process your entire database up front. On the plus side, you get very fast linear searches (linear in several parameters - you can read up the details on the linked wikipedia page).
Best Answer
This is somewhat of an XY answer but given you started with
It seems text search itself is a good, practical answer to your problem. The basic way of detecting plagiarisms would be the following:
This approach has several advantages over trying to diff strings:
If you think about Turn-it-in, their approach must be similar to this as they
The value that Turn-it-in and similar can add over just setting up a system like this yourself (which honestly would not be too hard) is