A colleague and I have recently argued over whether a pure regex is capable of fully encapsulating the csv format, such that it is capable of parsing all files with any given escape char, quote char, and separator char.
The regex need not be capable of changing these chars after creation, but it must not fail on any other edge case.
I have argued that this is impossible for just a tokenizer. The only regex that might be able to do this is a very complex PCRE style that moves beyond just tokenizing.
I am looking for something along the lines of:
… the csv format is a context free grammar and as such, it is
impossible to parse with regex alone …
Or am I wrong? Is it possible to parse csv with just a POSIX regex?
For example, if both the escape char and the quote char are "
, then these two lines are valid csv:
"""this is a test.""",""
"and he said,""What will be, will be."", to which I replied, ""Surely not!""","moving on to the next field here..."
Best Answer
Nice in theory, terrible in practice
By CSV I'm going to assume you mean the convention as described in RFC 4180.
While matching basic CSV data is trivial:
Note: BTW, it's a lot more efficient to use a .split('/n').split('"') function for very simple and well-structured data like this. Regular Expressions work as a NDFSM (Non-Deterministic Finite State Machine) that wastes a lot of time backtracking once you start adding edge cases like escape chars.
For example here's the most comprehensive regular expression matching string I've found:
It reasonably handles single and double quoted values, but not newlines in values, escaped quotes, etc.
Source: Stack Overflow - How can I parse a string with JavaScript
It's becomes a nightmare once the common edge-cases are introduced like...
The newline-as-value edge case alone is enough to break 99.9999% of the RegEx based parsers found in the wild. The only 'reasonable' alternative is to use RegEx matching for basic control/non-control character (ie terminal vs non-terminal) tokenization paired with a state machine used for higher level analysis.
Source: Experience otherwise known as extensive pain and suffering.
I am the author of jquery-CSV, the only javascript based, fully RFC-compliant, CSV parser in the world. I have spent months tackling this problem, speaking with many intelligent people, and trying a ton if different implementations including 3 full rewrites of the core parser engine.
tl;dr - Moral of the story, PCRE alone sucks for parsing anything but the most simple and strict regular (Ie Type-III) grammars. Albeit, it's useful for tokenizing terminal and non-terminal strings.