Java – Why For-Each Uses Colon Instead of ‘in’

java

From Java 5 language guide:

When you see the colon (:) read it as "in".

Why not use in in the first place then?

This has been bugging me for years.
Because it's inconsistent with the rest of the language.
For example, in Java there are implements, extends, super for relations between types instead of symbols like in C++, Scala or Ruby.

In Java colon used in 5 contexts.
Three of which are inherited from C.
And other two was endorsed by Joshua Bloch.
At least, that was he sais during "The closures controversy" talk.
This comes up when he criticises usage of a colon for mapping as inconsistent with for-each semantics.
Which to me seems odd because it's the for-each abused expected patterns.
Like list_name/category: elements or laberl/term: meaning.

I've snooped around jcp and jsr, but did not found no sign of mailing list.
No discussions on this matter found by google.
Only newbies confused by the meaning of colon in for.


Main arguments against in provided so far:

  • requires new keyword; and
  • complicates lexing.

Let's look at relevant grammar definitions:

statement
    :   'for' '(' forControl ')' statement
    |   ...
    ;

forControl
    :   enhancedForControl
    |   forInit? ';' expression? ';' forUpdate?
    ;

enhancedForControl
    :   variableModifier* type variableDeclaratorId ':' expression
    ;

Change from : to in don't bring additional complexity or requires new keyword.

Best Answer

Normal parsers as they are generally taught have a lexer stage before the parser touches the input. The lexer (also “scanner” or “tokenizer”) chops the input into small tokens that are annotated with a type. This allows the main parser to use tokens as terminal elements rather than having to treat each character as a terminal, which leads to noticeable efficiency gains. In particular, the lexer can also remove all comments and white space. However, a separate tokenizer phase means that keywords cannot also be used as identifiers (unless the language supports stropping which has somewhat fallen out of favour, or prefixes all identifiers with a sigil like $foo).

Why? Let's assume we have a simple tokenizer that understands the following tokens:

FOR = 'for'
LPAREN = '('
RPAREN = ')'
IN = 'in'
IDENT = /\w+/
COLON = ':'
SEMICOLON = ';'

The tokenizer will always match the longest token, and prefer keywords over identifiers. So interesting will be lexed as IDENT:interesting, but in will be lexed as IN, never as IDENT:interesting. A code snippet like

for(var in expression)

will be translated to the token stream

FOR LPAREN IDENT:var IN IDENT:expression RPAREN

So far, that works. But any variable in would be lexed as the keyword IN rather than a variable, which would break code. The lexer does not keep any state between the tokens, and cannot know that in should usually be a variable except when we are in a for loop. Also, the following code should be legal:

for(in in expression)

The first in would be an identifier, the second would be a keyword.

There are two reactions to this problem:

Contextual keywords are confusing, let's reuse keywords instead.

Java has many reserved words, some of which have no use except for providing more helpful error messages to programmers switching to Java from C++. Adding new keywords breaks code. Adding contextual keywords is confusing to a reader of the code unless they have good syntax highlighting, and makes tooling difficult to implement because they'll have to use more advanced parsing techniques (see below).

When we want to extend the language, the only sane approach is to use symbols that previously were not legal in the language. In particular, these can't be identifiers. With the foreach loop syntax, Java reused the existing : keyword with a new meaning. With lambdas, Java added a -> keyword which could not previously occur in any legal program (--> would still be lexed as '--' '>' which is legal, and -> might have previously been lexed as '-', '>', but that sequence would be rejected by the parser).

Contextual keywords simplify languages, let's implement them

Lexers are indisputably useful. But instead of running a lexer before the parser, we can run them in tandem with the parser. Bottom-up parsers always know the the set of token types that would be acceptable at any given location. The parser can then request the lexer to match any of these types at the current position. In a for-each loop, the parser would be at the position denoted by · in the (simplified) grammar after the variable has been found:

for_loop = for_loop_cstyle | for_each_loop
for_loop_cstyle = 'for' '(' declaration · ';' expression ';' expression ')'
for_each_loop = 'for' '(' declaration · 'in' expression ')'

At that position, the legal tokens are SEMICOLON or IN, but not IDENT. A keyword in would be entirely unambiguous.

In this particular example, top-down parsers wouldn't have a problem either since we can rewrite the above grammar to

for_loop = 'for' '(' declaration · for_loop_rest ')'
for_loop_rest =  · ';' expression ';' expression
for_loop_rest = · 'in' expression

and all the tokens necessary for the decision can be seen without backtracking.

Consider usability

Java has always tended towards semantic and syntactic simplicity. For example, the language doesn't support operator overloading because it would make code far more complicated. So when deciding between in and : for a for-each loop syntax, we have to consider which is less confusing and more apparent to users. The extreme case would probably be

for (in in in in())
for (in in : in())

(Note: Java has separate namespaces for type names, variables, and methods. I think this was a mistake, mostly. This does not mean later language design has to add more mistakes.)

Which alternative provides clearer visual separations between the iteration variable and the iterated collection? Which alternative can be recognized more quickly when you glance at the code? I've found that separating symbols are better than a string of words when it comes to these criteria. Other languages have different values. E.g. Python spells out many operators in English so that they can be read naturally and are easy to understand, but those same properties can make it quite difficult to understand a piece of Python at a glance.