General Address Parser for Freeform Text

algorithmgisparsingstreet-address

We have a program that displays map data (think Google Maps, but with much more interactivity and custom layers for our clients).

We allow navigation via a set of combo boxes that prefill certain fields with a bunch of data (ie: Country: Canada, the Province field is filled in. Select Ontario, and a list of Counties/Regions is filled in. Select a county/region, and a city is filled in, etc…).

While this guarantees accurate addresses, it's a pain for the users if they don't know where a street address or a city are located (ie, which county/region is kitchener in?).

So we are looking at trying to do an address parser with a freeform text field.

The user could enter something like this (similar to Google Maps, Bing Maps, etc…):
22 Main St, Kitchener, On

And we could compartmentalize it into sections and do lookups on the data and get to the point they are looking for (or suggest alternatives).

The problem with this is that how do we properly compartmentalize information? How do we break up the sections and find possible matches? I'm guessing we wouldn't be guaranteed that the user would enter data in a format we always expected (obviously). A follow up to this would be how to present the data if we don't find an exact match (or find multiple exact matches… two cities with the same street name in different counties, for example).

We have a ton of data available in the mapping data (mapinfo tab format mostly). So we can do quick scans of street names, cities, states, etc. But I'm not sure about the best way to go about approaching this problem. Sure, using Google Maps would be nice, bue most of our clients are in closed in networks where outside access is not usually allowed and most aren't willing to rely on google maps (since it doesn't contain as much information as they need, such as custom map layers). They could, obviously, go to google and get the proper location then move to our software, but this would time consuming and speed of the process can be quite important.

Best Answer

This is essentially a class of the Named Entity Resolution problem. NER on Wikipedia

The best way to approach this is to parse the address using a language transducer to identify various constructs - an approach is similar to using regular expressions with a finite state machine.

I've had great success with the Java NLP and Machine learning framework called GATE, and their transducer lib is called Jape. Check out their GUI, and use that to write some Java code for it!

Their built in examples should get you started with the basics, and you can then extend it as needed. Essentially, it compartmentalizes text into components using the rules and the rule engine, so something like,

Xyz, Blah St,
Foo City, 11110, CA

would be translated to,

Place: Xyz
Street: Blah St
City: Foo
...

And then you can use your database of locations to do matches.

Jape also supports dictionary lookups, apart from rules - so if you already have "Blah St" in your database, and it has 2 parents - city Foo and Bar - you just disambiguate by parsing the next line.

Edit: GATE includes a tool called ANNIE - an information extraction system, that can be played around with to identify addresses. This uses some built in Jape rules that you can build upon.