Formatting SQL Based on an AST

code generationformattingsql

I am working on my .NET open source project for generating SQL

Initially, I wrote the project so it spit out exactly one SQL format… ugly. I started using this project in the real world and decided it would be nice if I could generate pretty SQL in debug mode and ugly SQL in release mode.

My first go involved a lot of if/else loops within my command generation code. It quickly got to be too much to handle. So, I started looking for another approach that would allow me to plug-in different formatting strategies. I also looked at other SQL formatting libraries but most were specific to one dialect.

Right now, I am playing with the idea of generating a tree of tokens, similar to an abstract syntax tree (AST) that a compiler works with. I was hoping to navigate the tree and generate text based on where I am in the tree. Now that I've built these trees, I'm not sure how to write the code to visit these nodes.

I feel like I'm going backwards. Instead of parsing SQL, generating an AST and spitting out SQL objects, I am allowing someone to build SQL objects, then generate an AST and spitting out SQL. When someone uses my library to build a SELECT statement, they specify the values to select, the tables to select from, filters, ordering, etc. But once I have generated an AST… I have less information than I originally started with! Only by asking "Is the next token the keyword FROM?" can I get a picture of where I'm at. I have to write a bunch of if/elif/else statements to figure out what comes next. At least when I had the SQL objects I could use polymorphism to avoid nasty branching.

So, perhaps I missed a day in compiler theory or perhaps it's because I'm trying to go backwards. But, for whatever reason, my next step is really unclear to me. I think the AST route was in the right direction even if it does mean losing information. I was able to write a really stupid formatter that puts spaces between every token for now. It's really when it comes to getting fancy that I'm overwhelmed.

If anyone has some good resources for going backwards, it would probably be really helpful.

Best Answer

I had a chance to look at your actual code. It looks like you're most of the way there but this "tree of tokens" concept is getting in the way. It turns out you probably don't need it at all (specifically, I'm referring to the stuff in the Expressions/ folder).

For example, in Join, you have a method called GetDeclarationExpression that returns an IExpressionItem that is a tree of tokens. What I'm suggesting is that you can simplify this and instead directly return a String which is the actual SQL generated from the join. In concrete terms, something like the following:

    String IJoinItem.GetDeclarationExpression(CommandOptions options)
    {
        // [ "(" ] <Left> <Combiner> <Right> [ "ON" <Filter> ] [ ")" ]
        StringBuilder expression = new StringBuilder();
        if (WrapInParentheses ?? options.WrapJoinsInParentheses)
        {
            expression.Append("(");
        }
        expression.Append(_leftHand.GetDeclarationExpression(options));
        expression.Append(" ")
        expression.Append(GetJoinNameExpression(options));
        expression.Append(" ")
        expression.Append(_rightHand.GetDeclarationExpression(options));
        expression.Append(" ")
        expression.Append(GetOnExpression(options));
        if (WrapInParentheses ?? options.WrapJoinsInParentheses)
        {
            expression.Append(")");
        }
        return expression.ToString();
    }

Of course, all other methods used during the AST traversal would also need to be modified to return String instead of IExpressionItem as needed. Each method that is called during the tree traversal would be responsible for generating the actual SQL string, which will all eventually get concatenated together.

Related Topic