Python Strings – Complex String Matching with FuzzyWuzzy

pythonstrings

I'm attempting to write a process that matches obscure strings to a single 'master string' for further processing. I have a lot of data that looks something like this:

Basketball
Basket Ball
Football
BasketBallR
BBall
BBall - r
FootB

…and so on. These need to be mapped to a master record like so:

Basketball       = Basket Ball, BBall
Basketball - R   = BasketBallR, BBall - r

I also have instances of data resembling this format:

Football -r
FootBall - r-g/H,Q,HH

These situations need to be separated into different categories before being mapped. For example FootBall - r-g/H,Q,HH should be:

Football - r
Football - g
Football - H
Football - Q
Football - HH

At this point, it still needs to be mapped to a master record…

I've tried several different combinations of fuzzywuzzy matching methods, Levenshtein Distance measurements, regex, etc. and can't seem to find a reliable method to logically associate different naming styles of a single item with a master name.

I'm throwing my hands up in desperation. Are there any existing python resources than can help sort out my problem? Are there other options? Can anybody point out an obvious option that I might have overlooked?

Basically, any suggestion, solution, resource or alternative method is greatly appreciated.

Best Answer

Fortunately, I'm not one to give up easily. Through reaching out to other sources/communities I found Google Refine which (amazingly) completely resolves my matching issues 90% of the time, and leaves the remaining 10% incredibly manageable and easy to manually resolve. I hope this might help other people facing similar issues.

Related Topic