Algorithms – Best Algorithm for String Similarity

algorithmsstring-matching

I am designing a plugin to uniquely identify content on various web pages, based on addresses.

So I may have one address which looks like:

1 someawesome street, anytown, F100 211

later I may find this address in a slightly different format.

1 someawesome street, F100 211,

or perhaps as vague as

someawesome street F100

These are technically the same address, but with a level of similarity. I would like to a) generate a unique identifier for each address to perform lookups, and b) figure out when a very similar address shows up.

What algorithms / techniques / String metrics should I be looking at? Levenshtein distance seems like an obvious choice, but curious if there's any other approaches that would lend themselves here.

Best Answer

Levenstein's algorithm is based on the number of insertions, deletions, and substitutions in strings.

Unfortunately it doesn't take into account a common misspelling which is the transposition of 2 chars (e.g. someawesome vs someaewsome). So I'd prefer the more robust Damerau-Levenstein algorithm.

I don't think it's a good idea to apply the distance on whole strings because the time increases abruptly with the length of the strings compared. But even worse, when address components, like ZIP are removed, completely different addresses may match better (measured using online Levenshtein calculator):

1 someawesome street, anytown, F100 211       (reference) 
1 someawesome st.,anytown                     (difference of 15, same address)     
1 otherplaces street,anytown,F100211          (difference of 13, different ddress) 
1 sameawesome street, othertown, CA98200      (difference of 13, different ddress)
anytown, 1 someawesome street                 (28 different same address)
anytown, F100 211, 1 someawesome street       (37 different same address)

These effects tend to worsen for shorter street name.

So you'd better use smarter algorithms. For example, Arthur Ratz published on CodeProject an algorithm for smart text comparison. The algorithm doesn't print out a distance (it can certainly be enriched accordingly), but it identifies some difficult things such as moving of text blocks (e.g. the swap between town and street between my first example and my last example).

If such an algorithm is too general for your case, you should then really work by components and compare only comparable components. This is not an easy thing if you want to parse any address format in the world. But if the target is more specific, say US, it is certainly feasible. For example, "street", "st.", "place", "plazza", and their usual misspellings could reveal the street part of the address, the leading part of which would in principle be the number. The ZIP code would help to locate the town, or alternatively it is probably the last element of the address, or if you don't like guessing, you could look for a list of city names (e.g. downloading a free zip code database). You could then apply Damerau-Levenshtein on the relevant components only.

Related Solutions

Ny good search algorithm for a single character

It being understood that the worst case is O(N), there are some very nice micro-optimizations.

The naive method performs a character comparison and an end-of-text comparison for each character.

Using a sentinel (i.e. a copy of the target character at the end of the text) reduces the number of comparisons to 1 per character.

At the bit twiddling level there is:

#define haszero(v)      ( ((v) - 0x01010101UL) & ~(v) & 0x80808080UL )
#define hasvalue(x, n)  ( haszero((x) ^ (~0UL / 255 * (n))) )

to know if any byte in a word (x) has a specific value (n).

The subexpression v - 0x01010101UL, evaluates to a high bit set in any byte whenever the corresponding byte in v is zero or greater than 0x80.

The sub-expression ~v & 0x80808080UL evaluates to high bits set in bytes where the byte of v doesn't have its high bit set (so the byte was less than 0x80).

By ANDing these two sub-expressions (haszero) the result is the high bits set where the bytes in v were zero, since the high bits set due to a value greater than 0x80 in the first sub-expression are masked off by the second (April 27, 1987 by Alan Mycroft).

Now we can XOR the value to test (x) with a word that has been filled with the byte value in which we're interested (n). Because XORing a value with itself results in a zero byte and nonzero otherwise, we can pass the result to haszero.

This is often used in a typical strchr implementation.

(Stephen M Bennet suggested this on December 13, 2009. Further details in the well known Bit Twiddling Hacks).

this code is broken for any combination of 1111's next to a 0

The hack passes the brute force test (just be patient):

#include <iostream>
#include <limits>

bool haszero(std::uint32_t v)
{
  return (v - std::uint32_t(0x01010101)) & ~v & std::uint32_t(0x80808080);
}

bool hasvalue(std::uint32_t x, unsigned char n)
{
  return haszero(x ^ (~std::uint32_t(0) / 255 * n));
}

bool hasvalue_slow(std::uint32_t x, unsigned char n)
{
  for (unsigned i(0); i < 32; i += 8)
    if (((x >> i) & 0xFF) == n)
      return true;

  return false;
}

int main()
{
  const std::uint64_t stop(std::numeric_limits<std::uint32_t>::max());

  for (unsigned c(0); c < 256; ++c)
  {
    std::cout << "Testing " << c << std::endl;

    for (std::uint64_t w(0); w != stop; ++w)
    {
      if (w && w % 100000000 == 0)
        std::cout << w * 100 / stop << "%\r" << std::flush;

      const bool h(hasvalue(w, c));
      const bool hs(hasvalue_slow(w, c));

      if (h != hs)
        std::cerr << "hasvalue(" << w << ',' << c << ") is " << h << '\n';
    }
  }

  return 0;
}

Lots of upvotes for an answer which makes the assumption one chararacter=one byte, which is nowadays not the standard any more

Thank you for the remark.

The answer was meant to be anything but an essay on multi-byte / variable-width encodings :-) (in all fairness that's not my area of expertise and I'm not sure it's what the OP was looking for).

Anyway it seems to me that the above ideas/tricks could somewhat be adapted to MBE (especially self-synchronizing encodings):

as noted in Johan's comment the hack can 'easily' be extended to work for double bytes or anything (of course you cannot stretch it too much);
a typical function that locates a character in a multibyte character string:
- contains calls to strchr / strstr (e.g. GNUlib coreutils mbschr)
- expects them to be well tuned.
the sentinel technique can be used with a little foresight.

Algorithms – Best Methods for Detecting Plagiarism

This is somewhat of an XY answer but given you started with

read a body of text and compare it to search-engine results (from searching for substrings of the given text), with the goal of detecting plagiarism in, for example, academic papers.

It seems text search itself is a good, practical answer to your problem. The basic way of detecting plagiarisms would be the following:

Start with a corpus of documents that the target document could have been plagiarized.
Create, e.g., a Lucene based inverted index over those documents (through say Solr or Elasticsearch).
Split your target document into a set of phrases (e.g. by breaking off each sentence / sub-sentence / every n words).
Search your corpus for each phrase. You will return a (possibly empty) set of documents that that phrase could have been plagiarized from (and the location(s) in each document it was possibly taken from).
Collect all of these potential instances of plagiarism. If this exceeds more than a small threshold of phrases, alarm the target as probably being plagiarized.

This approach has several advantages over trying to diff strings:

It allows you to pinpoint exactly what in the target document might have been plagiarized and where it could have come from. This will allow humans reviewing the output to have visibility and make intelligent decisions on the output.
A good indexing solution will buy you the ability to work around misspellings and different stop words / tiny differences in phrasing.
A good indexing solution will scale very well.
Having a self-managed corpus will behave much better than searching the internet. The internet is such a wild and unruly place that you are likely to get spurious matches and miss out on important matches. That is, Google may catch students copying from Wikipedia, but it is also liable to falsely accuse people of copying from random blogs if you are not very, very careful. It is also liable to miss things like ArXiv papers in the field, essays students can buy from shady websites, past essays written from other students, that are very realistic sources of plagiarism.

If you think about Turn-it-in, their approach must be similar to this as they

Tell you where the essay could have been plagiarized
Can include past-papers / non-wiki & co. sourcing.

The value that Turn-it-in and similar can add over just setting up a system like this yourself (which honestly would not be too hard) is

Size and quality of their reference corpus
Development time of their UI
Tuning of their indexing and searching
Sophistication in how they determine phrases and their thresholds for likely plagiarism.

Best Answer

Related Solutions

Ny good search algorithm for a single character

Algorithms – Best Methods for Detecting Plagiarism

Related Topic