Your question (as your final paragraph hints) is not really about the lexer, it is about the correct design of the interface between the lexer and the parser. As you might imagine there are many books about the design of lexers and parsers. I happen to like the parser book by Dick Grune, but it may not be a good introductory book. I happen to intensely dislike the C-based book by Appel, because the code is not usefully extensible into your own compiler (because of the memory management issues inherent in the decision to pretend C is like ML). My own introduction was the book by PJ Brown, but it's not a good general introduction (though quite good for interpreters specifically). But back to your question.
The answer is, do as much as you can in the lexer without needing to use forward- or backward-looking constraints.
This means that (depending of course on the details of the language) you should recognise a string as a " character followed by a sequence of not-" and then another " character. Return that to the parser as a single unit. There are several reasons for this, but the important ones are
- This reduces the amount of state the parser needs to maintain, limiting its memory consumption.
- This allows the lexer implementation to concentrate on recognising the fundamental building blocks and frees the parser up to describe how the individual syntactic elements are used to build a program.
Very often parsers can take immediate actions on receiving a token from the lexer. For example, as soon as IDENTIFIER is received, the parser can perform a symbol table lookup to find out if the symbol is already known. If your parser also parses string constants as QUOTE (IDENTIFIER SPACES)* QUOTE you will perform a lot of irrelevant symbol table lookups, or you will end up hoisting the symbol table lookups higher up the parser's tree of syntax elements, because you can only do it at the point you're now sure you are not looking at a string.
To restate what I'm trying to say, but differently, the lexer should be concerned with the spelling of things, and the parser with the structure of things.
You might notice that my description of what a string looks like seems a lot like a regular expression. This is no coincidence. Lexical analysers are frequently implemented in little languages (in the sense of Jon Bentley's excellent Programming Pearls book) which use regular expressions. I'm just used to thinking in terms of regular expressions when recognising text.
Regarding your question about whitespace, recognise it in the lexer. If your language is intended to be pretty free-format, don't return WHITESPACE tokens to the parser, because it will only have to throw them away, so your parser's production rules will be spammed with noise essentially - things to recognise just to throw them away.
As for what that means about how you should handle whitespace when it is syntactically significant, I'm not sure I can make a judgment for you that will really work well without knowing more about your language. My snap judgment is to avoid cases where whitespace is sometimes important and sometimes not, and use some kind of delimiter (like quotes). But, if you can't design the language any which way you prefer, this option may not be available to you.
There are other ways to do design language parsing systems. Certainly there are compiler construction systems that allow you to specify a combined lexer and parser system (I think the Java version of ANTLR does this) but I have never used one.
Last a historical note. Decades ago, it was important for the lexer to do as much as possible before handing over to the parser, because the two programs would not fit in memory at the same time. Doing more in the lexer left more memory available to make the parser smart. I used to use the Whitesmiths C Compiler for a number of years, and if I understand correctly, it would operate in only 64KB of RAM (it was a small-model MS-DOS program) and even so it translated a variant of C that was very very close to ANSI C.
I think this is a great question. Unit testing, micro-testing, TDD, and BDD are difficult, sometimes confusing, and often not taken seriously. So kudos for trying and asking for help!
I am by no means an expert, but I did just take a formal class and am also part of a small team that is leading an initiative to roll out unit testing and TDD within our organization.
Based on my experience and what I learned in my class, here is what I think you need to know in order to start:
- Consider that there are different types of testing and they accomplish different goals. There are lots of things that unit testing is simply a poor or incomplete solution for, which can lead one to think that unit testing is a waste of time. However, I view it as complementary to integration testing and functional testing.
- Don't stress out about coverage. There is a time and place to think about code coverage, but if you worry about that right now, you will waste time unit testing accessors and mutating, which provides no business value.
- Don't think about testing classes or methods, but think about testing behaviors. What does this class do? What does it not do? (This is part of why the term BDD is used these days)
- Test interesting behaviors. What qualifies as interesting is up to you. Many experts argue that a bug is a missing test. Therefore, anything you feel could reasonably break is worth testing. Would you write a test for
i++
? Probably not. Would you write a test for incrementI()
? Maybe. If it's a calculator app, I probably would. If it's a web app, then I wouldn't sweat something that simple.
- Tests are not just about making sure your code works. They serve as an executable specification. Meaning, if somebody needs to understand the requirements of a class, they can look at the test.
- It is ok for a test to assume that a dependency works. If that worries you, then maybe you should unit test the dependency as well. But keep it separate. For example, one could argue that testing the data processing without a real file is bad because you don't know if the file parsing works. But if you have a unit test which verifies that file parsing works, and a unit test that verifies the processing of the file's data works, then clearly both work. Granted, this does not prove that they work together. But that's ok. That is integration testing.
- Unit tests should be repeatable and isolated. A unit test that makes a web service call, then writes to a database, and then prints some result to the console is not a unit test. It's a test, sure, but not a unit test. Don't get so hung up on testing everything that you write fragile tests which do a ton of work. So if a method does those things, how do you unit test it? This leads to the next topic...
- Separating out dependencies - Write small units which can be called independently, with dependencies passed in. If you design so that the processing is independent of the source file itself, you can test the processing logic without needing files. For example, instead of the file processing taking a
java.io.File
object, parsing it, and doing all the work, perhaps it could take a java.io.InputStream
, parse it, and do all the work? Or a method which reads the file's contents into an array or other data structure, then passes that into the processing?
- Mocking out dependencies - ultimately, somebody needs to talk to the external world. Ideally, you will separate your dependencies, but sometimes you reach points where the complexity introduced by doing so is counter-productive. For example, a method which several levels down the call stack needs to talk to the database and separating that out is unreasonable, then mock it out with a fake implementation that returns test data. I've gone as far as creating a
MockDatabase
class before, which lets me map stored procedure names to hard-coded, in-memory result sets.
- Realize when you have mocked so much that unit testing is pointless. If you have mocked out everything a method does, then you are left testing an empty method. If that is the case, maybe that method is not interesting to test? Trust that an integration test, automated functional test, or even manual testing will catch any problems with it, and don't worry about unit testing it.
I could go on forever. Hopefully that helps.
Remember, this is not easy. If it was easy, everybody would do it.
Best Answer
Your grammar probably has some rules for each token on how it can be produced (for example, that a
{
signifies a BLOCK_START token, or that a string-literal token is delimited by '"' characters). Start writing tests for those rules and verify that your lexer produces the correct token in each case.Once you have a test for each single token, you can add a few tests for interesting combinations of tokens. Focus here on token combinations that would reveal an error in your lexer. The token combinations don't have to make sense to a parser for your language, so it is entirely valid to use
+++++12
as input and expect the tokens INCREMENT, INCREMENT, PLUS, INTEGER_LITERAL(12) as output.And finally, make sure you have some tests for faulty inputs, where the lexer will not be able to recognize a token. And although I mention them last, they certainly don't have to be the last tests you create. You could just as well start with these.