How are comments usually parsed

commentsparsing

How are comments generally treated in programming languages and markup? I am writing a parser for some custom markup language and want to follow the principle of least surprise, so I'm trying to determine the general convention.

For example, should a comment embedded within a token 'interfere' with the token or not? Generally, is something like:

Sys/* comment */tem.out.println()

valid?

Also, if the language is sensitive to new lines, and the comment spans the new line, should the new line be considered or not?

stuff stuff /* this is comment
this is still comment */more stuff 

be treated as

stuff stuff more stuff

or

stuff stuff
more stuff

?

I know what a few specific languages do, nor am I looking for opinions, but am looking for whether or not: is there a general consensus what is generally expected by a mark up in regards to tokens and new lines?


My particular context is a wiki-like markup.

Best Answer

Usually comments are scanned (and discarded) as part of the tokenization process, but before parsing. A comment works like a token separator even in the absence of whitespace around it.

As you point out, the C specification explicitly states that comments are replaced by a single space. It is just specification-lingo though, since a real-world parser will not actually replace anything, but will just scan and discard a comment the same way it scans and discards whitespace characters. But it explains in a simple way that a comment separates tokens the same way a space would.

The content of comments are ignored, so linebreaks inside multiline comments have no effect. Languages which are sensitive to line breaks (Python and Visual Basic) usually do not have multiline comments, but JavaScript is one exception. For example:

return /*
       */ 17

Is equivalent to

return 17

not

return
17

Single-line comments preserve the line break, i.e.

return // single line comment
    17

is equivalent to

return
17

not

return 17

Since comments are scanned but not parsed, they tend not to nest. So

 /*  /* nested comment */ */

is a syntax error, since the comment is opened by the first /* and closed by the first */