Compiler Design – How ASCII Code Associations Are Stored and Retrieved

asciicompiler

I was reading about compilers and was given an example of creating a basic compiler that recognizes escape sequences without referencing ASCII.

Somebody suggested that once I compile a piece of code with ASCII numbers, I can then recompile a different snippet of code without actual reference to the
ASCII number, and can instead use the escape character explicitly.

I'm confused at how you can get from the ASCII number, to an actual escape sequence, there seems to be a missing step somewhere in the explanation.

Best Answer

The puzzle becomes quite a bit easier once yo understand a fundamental truth about data storage in computers. The truth is this:

Characters do not exist.

Computers cannot deal with characters. Printer drivers, video outputs, teletypes etc. can generate shapes of letters and therefore provide the necessary illusion that someone is writing something for you to read, but it's an illusion nevertheless. Fundamentally, computers store numbers and nothing else. In order to manage letters, words and texts, a common Encoding has to be agreed on by all users, and it is this character encoding that associates, e.g., A with 65.

It follows that the encoding has to be chosen, implemented and adhered to long before you write a compiler. Every program that is to print user-readable output must be able to store a 65 and rely on the convention that the O.S. display driver will show an "A" in its place, etc. In other words, whatever operating system you write a compiler on already has a convention for associating numbers with characters, and the new-written compiler will simply use the same convention that all existing programs do.

In the case of escape sequences, it's often something like "a \n should correspond to a newline character, a \r to a carriage return character". Those two characters are - by the already existing convention - 10 and 13, so it's enough to simply put those constants in to do the right thing. The compiler doesn't know these values anymore than a text editor does. An editor accepts the number 65 (as sent by the keyboard driver when you press "A") and stores it on disk, and later it retrieves it from disk and the display driver simulates an "A" on the screen in its place. This works because the keyboard driver and the display driver agree on how to interpret the numbers in your text file.

It's the same with a compiler and the language runtime. Programming with characters works because the compiler accepting your source code and the runtime and I/O hardware that run your executable agree on how to map characters to numbers and back. If you port a compiler to a different platform with a different character encoding (rare today but a common task when compilers were invented), this will always involve at least some hackery to adapt this mapping. (This is one of the times where you realize just how lucky we are that such fundamental issues have largely been standardized and our generation can usually forget about them - but the issues are still there, and compiler writes need to understand them.)