Incorporating functions into a Shunting-Yard algorithm implementation

algorithmsgrammarparsingprogramming-languagesshunting-yard

tl;dr What would be a simple way of incorporating functions into a Shunting-Yard algorithm implementation?

If only expressions like function(arg1, arg2, arg3) were allowed (where function is some builtin function), then it would be really easy, since I could just treat function like an operator. But consider a case where the user defines their own function like f = function, then calls f(arg1, arg2, arg3). In this case I would need a strongly-typed AST to detect at compile time what f's type is in order to see that the proceeding tokens ((arg1, arg2, arg3)) are actually a function call, and not just a construction of a tuple.

Even worse, consider (f)() where f is a user-defined nullary function. Then when I get to f, even if I know that it's a function, the next token will be ), which is not the start of a valid function call. What about (l[i])(), where l is a list of functions?

At the most general level, I understand grammatically that when we have a statement like [expression], "(", [expression], ")", then we know that we're calling a function. However, I'm not quite sure how to check this without implementation an AST (which, for simplicity's sake, I would rather not do).

I could store a list of all operator and "bracket" tokens, and then when I reach the "(" in a supposed function call I just check whether the last non-bracket token was an operator. If it was an operator then the "(" represents a subexpression, like in 5 * (3 - 8). If it wasn't an operator, then the "(" represents a function call. However, this method feels easily broken. For example, what if there where some operator $ that was "unary left-associative", so that (expression $)(args) was valid? Then the algorithm would fail unless I had special checking for $. What if there was a comment between the function and the function call, like function \* comment *\ (args)? Or even worse, something like

function \\ lol the last token in this comment is an operator +
    (args)

These would require implementing handlers a lot of special cases, and I'm wondering if there's a better way of doing it.

Best Answer

I use a two state, two stack model similar to the Shunting Yard algorithm. The two states are the unary state and the binary state. In each state, you handle tokens (or issue errors) and then either continue the same state or switch to the other state.

In the unary state, if you find an open paren, (, it is operator grouping, and, you stay in the unary state. (In the unary state, an identifier is a variable reference you push it onto the operand stack, and switch to the binary state; a minus, -, is a unary negation, and you stay in unary state.)

In the binary state, if you find an open paren, it is a function invocation (and you switch to the unary state). Differentiating between the two states also help you deal with unary and pre/post-fix operators. (Here a minus, -, is a binary subtraction and you switch to unary state.)


ok, for the expression (*p[i++])(), p is a pointer to pointer to function, right? So,

enter image description here

(Note that the empty parameter list expression should only be allowed when function calls are actually present; there are various ways and places for detecting the various possible errors.)

The actual tokens handled in unary state and in binary state are determined by the language you're trying to parse. I allow separate code sections for the unary state and the binary state. There seems to be no point in mixing them and then you don't need a state variable either (the code section just knows which one it is).

Related Topic