Can CSV Format Be Defined by Regex?

csvparsingregular expressions

A colleague and I have recently argued over whether a pure regex is capable of fully encapsulating the csv format, such that it is capable of parsing all files with any given escape char, quote char, and separator char.

The regex need not be capable of changing these chars after creation, but it must not fail on any other edge case.

I have argued that this is impossible for just a tokenizer. The only regex that might be able to do this is a very complex PCRE style that moves beyond just tokenizing.

I am looking for something along the lines of:

… the csv format is a context free grammar and as such, it is
impossible to parse with regex alone …

Or am I wrong? Is it possible to parse csv with just a POSIX regex?

For example, if both the escape char and the quote char are ", then these two lines are valid csv:

"""this is a test.""",""
"and he said,""What will be, will be."", to which I replied, ""Surely not!""","moving on to the next field here..."

Best Answer

Nice in theory, terrible in practice

By CSV I'm going to assume you mean the convention as described in RFC 4180.

While matching basic CSV data is trivial:

"data", "more data"

Note: BTW, it's a lot more efficient to use a .split('/n').split('"') function for very simple and well-structured data like this. Regular Expressions work as a NDFSM (Non-Deterministic Finite State Machine) that wastes a lot of time backtracking once you start adding edge cases like escape chars.

For example here's the most comprehensive regular expression matching string I've found:

re_valid = r"""
# Validate a CSV string having single, double or un-quoted values.
^                                   # Anchor to start of string.
\s*                                 # Allow whitespace before value.
(?:                                 # Group for value alternatives.
  '[^'\\]*(?:\\[\S\s][^'\\]*)*'     # Either Single quoted string,
| "[^"\\]*(?:\\[\S\s][^"\\]*)*"     # or Double quoted string,
| [^,'"\s\\]*(?:\s+[^,'"\s\\]+)*    # or Non-comma, non-quote stuff.
)                                   # End group of value alternatives.
\s*                                 # Allow whitespace after value.
(?:                                 # Zero or more additional values
  ,                                 # Values separated by a comma.
  \s*                               # Allow whitespace before value.
  (?:                               # Group for value alternatives.
    '[^'\\]*(?:\\[\S\s][^'\\]*)*'   # Either Single quoted string,
  | "[^"\\]*(?:\\[\S\s][^"\\]*)*"   # or Double quoted string,
  | [^,'"\s\\]*(?:\s+[^,'"\s\\]+)*  # or Non-comma, non-quote stuff.
  )                                 # End group of value alternatives.
  \s*                               # Allow whitespace after value.
)*                                  # Zero or more additional values
$                                   # Anchor to end of string.
"""

It reasonably handles single and double quoted values, but not newlines in values, escaped quotes, etc.

Source: Stack Overflow - How can I parse a string with JavaScript

It's becomes a nightmare once the common edge-cases are introduced like...

"such as ""escaped""","data"
"values that contain /n newline chars",""
"escaped, commas, like",",these"
"un-delimited data like", this
"","empty values"
"empty trailing values",        // <- this is completely valid
                                // <- trailing newline, may or may not be included

The newline-as-value edge case alone is enough to break 99.9999% of the RegEx based parsers found in the wild. The only 'reasonable' alternative is to use RegEx matching for basic control/non-control character (ie terminal vs non-terminal) tokenization paired with a state machine used for higher level analysis.

Source: Experience otherwise known as extensive pain and suffering.

I am the author of jquery-CSV, the only javascript based, fully RFC-compliant, CSV parser in the world. I have spent months tackling this problem, speaking with many intelligent people, and trying a ton if different implementations including 3 full rewrites of the core parser engine.

tl;dr - Moral of the story, PCRE alone sucks for parsing anything but the most simple and strict regular (Ie Type-III) grammars. Albeit, it's useful for tokenizing terminal and non-terminal strings.

Related Topic