Domain-specific language for text search/processing

dslregular expressionstext processing

I work for an organization that does a lot of work with government data. We have a couple of different projects where we've abstracted out common text search/manipulation operations into reusable libraries, for things like standardizing the way politicians' names are displayed (e.g., transforming "MCDONALD, BOB (R-VA)" into "Bob McDonald (R-VA)"), or finding legal citations in text (e.g., finding a reference to (e.g., finding occurrences of things like "1 U.S.C. 7" in text, determining that it's a US Code citation, and returning a structure that says it's referring to section 1 of title 7). These are relatively simple operations, and lots of collaborators in our space would like to use them, but we end up having to pick a language in which to implement each (the former is in Python; the latter, Javascript), and we freeze out potential consumers/contributors who work in different languages and don't want to resort to hacks like shelling out to a node process to handle their text. This all seems like a shame because what we're expressing is so simple, and ought, one would think, to be pretty easy to share.

What would be ideal would be a tiny DSL that could express a few basic text processing operations: regular expression search/replace, a few list-processing operations like map and filter, and the ability to store stuff in JSON-ish data structures (maps and lists), and a mechanism to either translate this DSL into or allow it to be consumed from the actual higher-level languages we and our collaborators want to work with (Python, JS, Ruby, and PHP are probably the main ones). Does anything like this exist?

I've considered building one myself… maybe a declarative thing on top of something like YAML, or maybe a tiny subset of Scheme or Lua, or maybe something entirely invented for this purpose. But I wanted to see if anything was already out there first.

Best Answer

The best language I'm aware of specific to text search and processing is awk. If awk doesn't meet your needs, it's likely nothing will unless you create it yourself.

However, if you do need to make your own, you don't need to start completely from scratch for each language. You can use a tool like antlr that can be exported to various languages, or build it in one language and use the respective native interfaces to access it from other languages.

Related Topic