The most unusual aspect of Python is that whitespace is significant
instead of block delimiters (braces → "{}" in the C family of languages), indentation is used to indicate where blocks begin and end.
how can python interpreter recognize code block ?
Python – How Python Interpreter Recognizes Code Blocks
interpreterslanguage-designpython
Related Solutions
Your question (as your final paragraph hints) is not really about the lexer, it is about the correct design of the interface between the lexer and the parser. As you might imagine there are many books about the design of lexers and parsers. I happen to like the parser book by Dick Grune, but it may not be a good introductory book. I happen to intensely dislike the C-based book by Appel, because the code is not usefully extensible into your own compiler (because of the memory management issues inherent in the decision to pretend C is like ML). My own introduction was the book by PJ Brown, but it's not a good general introduction (though quite good for interpreters specifically). But back to your question.
The answer is, do as much as you can in the lexer without needing to use forward- or backward-looking constraints.
This means that (depending of course on the details of the language) you should recognise a string as a " character followed by a sequence of not-" and then another " character. Return that to the parser as a single unit. There are several reasons for this, but the important ones are
- This reduces the amount of state the parser needs to maintain, limiting its memory consumption.
- This allows the lexer implementation to concentrate on recognising the fundamental building blocks and frees the parser up to describe how the individual syntactic elements are used to build a program.
Very often parsers can take immediate actions on receiving a token from the lexer. For example, as soon as IDENTIFIER is received, the parser can perform a symbol table lookup to find out if the symbol is already known. If your parser also parses string constants as QUOTE (IDENTIFIER SPACES)* QUOTE you will perform a lot of irrelevant symbol table lookups, or you will end up hoisting the symbol table lookups higher up the parser's tree of syntax elements, because you can only do it at the point you're now sure you are not looking at a string.
To restate what I'm trying to say, but differently, the lexer should be concerned with the spelling of things, and the parser with the structure of things.
You might notice that my description of what a string looks like seems a lot like a regular expression. This is no coincidence. Lexical analysers are frequently implemented in little languages (in the sense of Jon Bentley's excellent Programming Pearls book) which use regular expressions. I'm just used to thinking in terms of regular expressions when recognising text.
Regarding your question about whitespace, recognise it in the lexer. If your language is intended to be pretty free-format, don't return WHITESPACE tokens to the parser, because it will only have to throw them away, so your parser's production rules will be spammed with noise essentially - things to recognise just to throw them away.
As for what that means about how you should handle whitespace when it is syntactically significant, I'm not sure I can make a judgment for you that will really work well without knowing more about your language. My snap judgment is to avoid cases where whitespace is sometimes important and sometimes not, and use some kind of delimiter (like quotes). But, if you can't design the language any which way you prefer, this option may not be available to you.
There are other ways to do design language parsing systems. Certainly there are compiler construction systems that allow you to specify a combined lexer and parser system (I think the Java version of ANTLR does this) but I have never used one.
Last a historical note. Decades ago, it was important for the lexer to do as much as possible before handing over to the parser, because the two programs would not fit in memory at the same time. Doing more in the lexer left more memory available to make the parser smart. I used to use the Whitesmiths C Compiler for a number of years, and if I understand correctly, it would operate in only 64KB of RAM (it was a small-model MS-DOS program) and even so it translated a variant of C that was very very close to ANSI C.
At the risk of giving a "me too" answer, if you try it you'll see...
If you study computer languages, you're likely to get the impression that it's at least half about parsing. If you learn Lisp, you'll realize parsing the surface syntax is nothing more than a convenience for people (like most of us) who don't like Lots of Irritating Single Parentheses.
Then you could realize that a big price was paid for that convenience. In Lisp it is trivial for a program to build another program and execute it. In other languages it is an advanced technique, like doing multiplication in roman numerals.
Of course, nearly everybody would ask "who needs to do that?" Well, you could very well see that it opens up a whole vista of things you never even realized you couldn't do before. You can do it in other languages, but not nearly so easily.
INSERTED to answer Izkata comment:
- The SHRDLU natural-language-understanding program worked by translating an English statement or question into a program in a Lisp dialect called MICRO-PLANNER and executing it.
- Programs that manipulate programs, for example to simplify them or prove them correct are naturally written in Lisp.
- I used program generation in a program to comprehend visual scenes, where it had to deal with all the symmetries capable in 3-dimensional objects, without multiplying the code.
- Anything having to do with logic and theorem-proving deals in manipulating logical expressions, which are a form of program.
- Symbolic math, such as symbolic integral or differential calculus, involves manipulating math expressions, which are like miniature programs.
- Any problem involving code generation, or the more highbrow term "partial evaluation", is natural in Lisp. I did this for a database bridge program a long time ago. I did it in C, which was not as easy as Lisp, but I got the idea from Lisp. That was regarded as a technique that almost nobody could do at that time (especially the COBOL-heads). Maybe more now, I hope.
... that's just a few ...
Then you realize some things that are considered "modern" today have been old-hat in Lisp for 40-some years. Like functional programming. Like garbage collection. Like closures.
That's not to say modern languages don't have new good ideas, like OOP, etc. But if you learn Lisp it will broaden your perspective.
Best Answer
The Python documentation, in the Lexical Analysis section, describes briefly how the indentation parsing works. In short, the tokeniser generates special INDENT and DEDENT tokens that are used by the parser when deciding where blocks of code start and end. These tokens (roughly) correspond to the
{
and}
tokens in C-like languages.