Best way to handle delimited files

delimited filesfile handlingfile structure

So typically a CSV file uses a comma and the return character as it's field and line delimiters.

This runs into obvious problems with text which can contain both these characters.

Obviously there are options there (escaping them) but how do people handle this? Use different characters – pipes or tildas? Escape them? Not use delimited files, after all it's 2010 and we have XML now?

Looking at least effort for a decent chance of not seeing problems.

(Just to be clear, this is a question out of curiosity rather than something more solid – it's something I've come up against time and time again playing with data, always got round it but it normally feels a bit, well, dirty, and wondered what other people's experience was).

Best Answer

According to Wikipedia:

Fields with embedded commas must be enclosed within double-quote characters.

And furthermore:

Fields with embedded double-quote characters must be enclosed within double-quote characters, and each of the embedded double-quote characters must be represented by a pair of double-quote characters.

I don't know who invented that, but it effectively shows that eventually you have to escape. It's the only solid solution. Everything else is just duct tape on top of duct tape: maybe works for now, but eventually you'll bump on a case where you need an exception to the exception of an exception, and it doesn't take long before your mudball of rules is way more complex than a simple escape character solution had been.

It seems that CSV creators first tried to avoid escaping commas by coming up with double-quoted special syntax, which allowed saving commas, but then someone wanted to save double-quote characters too, so they had to escape at that point - funnily using the double-quote as an escape character. Had they decided to escape properly in the first place, the syntax would be simpler now.

Related Topic