File Handling – Reasons to Open Files in Text Mode

file handlingparsingportability

(Almost-)POSIX-compliant operating systems and Windows are known to distinguish between 'binary mode' and 'text mode' file I/O. While the former mode doesn't transform any data between the actual file or stream and the application, the latter 'translates' the contents to some standard format in a platform-specific manner: line endings are transparently translated to '\n' in C, and some platforms (CP/M, DOS and Windows) cut off a file when a byte with value 0x1A is found.

These transformations seem a little useless to me.

People share files between computers with different operating systems. Text mode would cause some data to be handled differently across some platforms, so when this matters, one would probably use binary mode instead.

As an example: while Windows uses the sequence CR LF to end a line in text mode, UNIX text mode will not treat CR as part of the line ending sequence. Applications would have to filter that noise themselves. Older Mac versions only use CR in text mode as line endings, so neither UNIX nor Windows would understand its files. If this matters, a portable application would probably implement the parsing by itself instead of using text mode.

Implementing newline interpretation in the parser might also remove some overhead of using text mode, as buffers would need to be rewritten (and possibly resized) before returning to the application, while this may be less efficient than when it would happen in the application instead.

So, my question is: is there any good reason to still rely on the host OS to translate line endings and file truncation?

Best Answer

Without the translation, every Unix text-processing program would recognize just '\n' as an end-of-line marker, and every Windows text-processing program would recognize '\r' followed by '\n'. (And pre-OSX Mac programs would recognize '\r'.) And any program that writes text would have to explicitly write the local end-of-line marker, which means it would have to be aware of what OS it's running on.

And that just covers the relatively simple cases where an end-of-line is indicated by a sequence of characters. Other schemes are common (though less so these days); see IBM mainframes and VMS, for example.

With the translation, programs can just treat text as text, and we don't need three or more different "hello, world" programs.

This does sometimes cause problems when you need to process a foreign file (a Windows text file that's been copied to a Unix system, or vice versa). Cygwin, a Unix-like environment running under Windows, is a rich source of such issues. But usually the best solution is to translate the file before processing it. And most of the time, programs deal with text files that were created on the same OS anyway.

It's better to write one program that can translate between formats, than to require every program to deal with all different formats. And inevitably someone would write a tool that understand Unix and Windows formats, but breaks when confronted with an old Mac text file, and someone else would get the interpretation just a little bit wrong because the wheel they reinvented wasn't perfectly round.

Related Topic