Regex – Fuzzy regular expressions

fuzzy-comparisonperlregex

I am looking for a way to do a fuzzy match using regular expressions. I'd like to use Perl, but if someone can recommend any way to do this that would be helpful.

As an example, I want to match a string on the words "New York" preceded by a 2-digit number. The difficulty comes because the text is from OCR of a PDF, so I want to do a fuzzy match. I'd like to match:

12 New York
24 Hew York
33 New Yobk

and other "close" matches (in the sense of the Levenshtein distance), but not:

aa New York
11 Detroit

Obviously, I will need to specify the allowable distance ("fuzziness") for the match.

As I understand it, I cannot use the String::Approx Perl module to do this, because I need to include a regular expression in my match (to match the preceding digits).

Also, I should note that this is a very simplified example of what I'm really trying to match, so I'm not looking for a brute-force approach.

Edited to add:

Okay, my first example was too simple. I didn't mean for people to get hung up on the preceding digits — sorry about the bad example. Here's a better example. Consider this string:

ASSIGNOR, BY MESHS ASSIGN1IBNTS, TO ALUSCHALME&S MANOTAC/rURINGCOMPANY, A COBPOBATlOH OF DELAY/ABE.

What this actually says is:

ASSIGNOR, BY MESNE ASSIGNMENTS, TO ALLIS-CHALMERS MANUFACTURING COMPANY, A CORPORATION OF DELAWARE

What I need to do is extract the phrase "ALUSCHALME&S MANOTAC/rURINGCOMPANY" and "DELAY/ABE". (I realize this might seem like madness. But I'm an optimist.) In general, the pattern will look something like this:

/Assignor(, by mesne assignments,)? to (company name), a corporation of (state)/i

where the matching is fuzzy.

Best Answer

If you have one pattern you want to find the best match against a text collection you can try q-gram distance. It is quite easy to implement and adopt to special needs.

Your second description actually was helpful here, because the pattern and texts should be fairly long. q-gram distance does not work well with words like "York", but if your typical pattern is a whole address, that should be fine.

Try it like this:

transform your texts and patterns into a reduced character set, like uppercase-only, stripping, wordifying (one space between words) all symbols replaced by "#" or something.
choose a q-gram length, to work with. Try 3 or 2. We call this q=3.
than, build a qgram-profile of each text:
split each text into q-words, ie. NEW_YORK becomes [NEW, EW_, W_Y, _YO, ORK], store this away with each text.
if you search for your pattern then, you do the same with your pattern,
loop through your text-qgram-database and
- count for each pattern/text-pair how many qgrams are the same.
- each hit will raise the score by 1.
the texts with the highest score(s) are your best hits.

If you did that you can tweak this algorithm by:

prepend all you texts (and also the pattern before search), with q-1 special chars, so even your short words will get a decent profile. For example New York becomes ^^NEW YORK$$.
You can even play around with replacing all consonants with "x" and vowels with "o" and so on. Play around with a couple of character classes this way, or even create super symbols by replacing groups of character by one other, i.e. CK becomes K, or SCH becomes $.
when raising the score by a qgram-hit you can adjust the value of 1 by other things, like length-difference of text vs pattern.
store 2-grams and 3-grams both, and when counting, weigh then differently.

Note that this algorithm in the here described basic form does not have a good running time during search, i.e. O(|T|*|P|) (with |T| and |P| the total lengths of your text and pattern). This is because I described that you loop over all your texts, and then over your pattern. Therefore this is only practical for a medium-sized texts-base. If you spend some thought, you can create an advanced index structure over the q-grams (maybe using hashtables), so this might be practical for huge texts-bases as well.

Related Solutions

Regex – regular expression to detect a valid regular expression

/
^                                             # start of string
(                                             # first group start
  (?:
    (?:[^?+*{}()[\]\\|]+                      # literals and ^, $
     | \\.                                    # escaped characters
     | \[ (?: \^?\\. | \^[^\\] | [^\\^] )     # character classes
          (?: [^\]\\]+ | \\. )* \]
     | \( (?:\?[:=!]|\?<[=!]|\?>)? (?1)?? \)  # parenthesis, with recursive content
     | \(\? (?:R|[+-]?\d+) \)                 # recursive matching
     )
    (?: (?:[?+*]|\{\d+(?:,\d*)?\}) [?+]? )?   # quantifiers
  | \|                                        # alternative
  )*                                          # repeat content
)                                             # end first group
$                                             # end of string
/

This is a recursive regex, and is not supported by many regex engines. PCRE based ones should support it.

Without whitespace and comments:

/^((?:(?:[^?+*{}()[\]\\|]+|\\.|\[(?:\^?\\.|\^[^\\]|[^\\^])(?:[^\]\\]+|\\.)*\]|\((?:\?[:=!]|\?<[=!]|\?>)?(?1)??\)|\(\?(?:R|[+-]?\d+)\))(?:(?:[?+*]|\{\d+(?:,\d*)?\})[?+]?)?|\|)*)$/

.NET does not support recursion directly. (The (?1) and (?R) constructs.) The recursion would have to be converted to counting balanced groups:

^                                         # start of string
(?:
  (?: [^?+*{}()[\]\\|]+                   # literals and ^, $
   | \\.                                  # escaped characters
   | \[ (?: \^?\\. | \^[^\\] | [^\\^] )   # character classes
        (?: [^\]\\]+ | \\. )* \]
   | \( (?:\?[:=!]
         | \?<[=!]
         | \?>
         | \?<[^\W\d]\w*>
         | \?'[^\W\d]\w*'
         )?                               # opening of group
     (?<N>)                               #   increment counter
   | \)                                   # closing of group
     (?<-N>)                              #   decrement counter
   )
  (?: (?:[?+*]|\{\d+(?:,\d*)?\}) [?+]? )? # quantifiers
| \|                                      # alternative
)*                                        # repeat content
$                                         # end of string
(?(N)(?!))                                # fail if counter is non-zero.

Compacted:

^(?:(?:[^?+*{}()[\]\\|]+|\\.|\[(?:\^?\\.|\^[^\\]|[^\\^])(?:[^\]\\]+|\\.)*\]|\((?:\?[:=!]|\?<[=!]|\?>|\?<[^\W\d]\w*>|\?'[^\W\d]\w*')?(?<N>)|\)(?<-N>))(?:(?:[?+*]|\{\d+(?:,\d*)?\})[?+]?)?|\|)*$(?(N)(?!))

From the comments:

Will this validate substitutions and translations?

It will validate just the regex part of substitutions and translations. s/<this part>/.../

It is not theoretically possible to match all valid regex grammars with a regex.

It is possible if the regex engine supports recursion, such as PCRE, but that can't really be called regular expressions any more.

Indeed, a "recursive regular expression" is not a regular expression. But this an often-accepted extension to regex engines... Ironically, this extended regex doesn't match extended regexes.

"In theory, theory and practice are the same. In practice, they're not." Almost everyone who knows regular expressions knows that regular expressions does not support recursion. But PCRE and most other implementations support much more than basic regular expressions.

using this with shell script in the grep command , it shows me some error.. grep: Invalid content of {} . I am making a script that could grep a code base to find all the files that contain regular expressions

This pattern exploits an extension called recursive regular expressions. This is not supported by the POSIX flavor of regex. You could try with the -P switch, to enable the PCRE regex flavor.

Regex itself "is not a regular language and hence cannot be parsed by regular expression..."

This is true for classical regular expressions. Some modern implementations allow recursion, which makes it into a Context Free language, although it is somewhat verbose for this task.

I see where you're matching []()/\. and other special regex characters. Where are you allowing non-special characters? It seems like this will match ^(?:[\.]+)$, but not ^abcdefg$. That's a valid regex.

[^?+*{}()[\]\\|] will match any single character, not part of any of the other constructs. This includes both literal (a - z), and certain special characters (^, $, .).

Javascript – jQuery selector regular expressions

You can use the filter function to apply more complicated regex matching.

Here's an example which would just match the first three divs:

$('div')
  .filter(function() {
    return this.id.match(/abc+d/);
  })
  .html("Matched!");

<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.3.1/jquery.min.js"></script>

<div id="abcd">Not matched</div>
<div id="abccd">Not matched</div>
<div id="abcccd">Not matched</div>
<div id="abd">Not matched</div>

Best Answer

Related Solutions

Regex – regular expression to detect a valid regular expression

Javascript – jQuery selector regular expressions

Related Topic