Quantitatively comparing AST shapes

big datamachine learningparsingsyntax

How could one compare the shape of abstract syntax trees of similar source code programs (C, C++, Go, or anything compiled with GCC…)?

I guess that plagiarism detection on source code would use such techniques, but I have no idea of how would that be called…

For example, unification could be used to compare AST, but it gives only a boolean answer. I'm seeking for some technique giving some numerical "distance", or some kind of numerical vectors (to be later feed up e.g. to machine learning or classification algorithms, or some other big data thing).

Any references to big data or machine learning approaches on large set of source code is welcome too.

^{(Sorry for such a broad or fuzzy question, I don't know what terminology to use)}

I don't simply want to compare two ASTs or programs. I want to process a large set of programs (e.g. half of a Debian distribution source code) and find inside it similar routines. I already have MELT to work on GCC internal representations (Gimple) and I want to leverage above that, hence store several metrics (which ones? cyclomatic complexity is probably not enough) in e.g. some database and compare & process them…

Addenda: Found about the MOSS system & paper, but it does not seem to care about syntactic shape at all. Also looking into tree edit distance.

Found also (thanks to Jérémie Salvucci) Michel Chilowicz's PhD thesis (in French, november 2010) on Looking for Similarity in Source Code

Best Answer

One approach would be to compile the source to XML and then look at how different the two bits of source are. For example, in the Java world, the static analysis tool pmd does this as its approach to looking for things to warn about.

class Example {
 void bar() {
  while (baz)
   buz.doSomething();
 }
}

gets 'compiled' to:

CompilationUnit
 TypeDeclaration
  ClassDeclaration:(package private)
   UnmodifiedClassDeclaration(Example)
    ClassBody
     ClassBodyDeclaration
      MethodDeclaration:(package private)
       ResultType
       MethodDeclarator(bar)
        FormalParameters
       Block
        BlockStatement
         Statement
          WhileStatement
           Expression
            PrimaryExpression
             PrimaryPrefix
              Name:baz
           Statement
            StatementExpression:null
             PrimaryExpression
              PrimaryPrefix
               Name:buz.doSomething
              PrimarySuffix
               Arguments

And that point you would be comparing code by saying "the difference between this code and that code is that this name is different." As the above is actually xml, this could be done with any number of xml comparison tools that exist. Or if you were after a number, one could apply a tree edit distance algorithm on it (related SO question).

Another approach is to look at the 'signature' of the code shape. The Signature Survey was done by Ward Cunningham

That legend is a bit hard to read:

14m means 14 methods
294L is 294 lines.
. is a non blank line
' is a comment
| (green) is a single line if statement.
(.) (green) is a single statement inside an if block
[(.)] (brown) is a single statement inside of an if inside a loop.
{.} is a method with a single statement.
[.] (red) is a single statement inside a loop
([.]) (dark red) is a single statement inside a loop inside an if block.

Comparing two sets of code then is looking at the edit distance between two strings with a very limited language.

Related Solutions

Java Parsing – AST Construction in LL1 Non-Recursive Parser

It happens just as the cited text from the book explains: when you expand a nonterminal via its grammar rule (given by M[X, a]), then you can create a corresponding node.

Say you have rules of the following form:

Term -> Factor Term'
Term' -> * Term | / Term | ε

Factor -> x | y | ... (simplified for individual numbers, letters, what-have-you)

Then, once you expand Term -> Factor Term' you can create a Term node with two child nodes. When you successfully parse the first number via the Factor -> ... rule (this is the first if in your example code now) you can attribute this number to the already created Factor node.

Next, you expand for example Term' -> * Term via M[Term',*] and create a new Term node.

Continuing, you will parse the * and annotate it at your Term' node, expand Term -> Factor Term' once more, thus creating two more nodes, successfully parse a Factor and annotate its number to the second Factor node and finally, on end of input you will parse Term' via the epsilon production (M[Term',$] = ε), which tells you that you can remove that Term' node (though that may be optional).

What you end up with for an input string like 3*4 is then a tree like this:

Term ( Factor(3), Term' (*, Term ( Factor(4) ) ) )

In a post-processing step, you could simplify the resulting tree, as nonterminals like Term' stem from making the grammar non-left-recursive, but are otherwise unsuitable for the resulting AST, so you would want to reverse the grammar transformation on your resulting tree to get something like this:

Mult (Number(3), Number(4))

Visitor Pattern – Traversing an AST Using Visitors

Who is responsible for the traversal depends for a large part on the analysis you want to do in your visitors, the details of the language structure and also a part personal preference.

In particular, if there are cases where the visitor to a parent node needs to take an action halfway through the processing of the children, then you must put the traversal logic in the visitor. For example, if your language has a construct where a newly introduced variable is available only in some of the child nodes of the node that introduces the variable.
Another case is when you need a mixture of pre-order and post-order traversal. With traversal in the nodes, each node must call the visitor twice, once before and once after the children. In that case, it might be easier to let the visitor do the traversal.

Otherwise, it is mostly a matter of preference. The traversal can be either in the nodes or in the visitor.

Best Answer

Related Solutions

Java Parsing – AST Construction in LL1 Non-Recursive Parser

Visitor Pattern – Traversing an AST Using Visitors

Related Topic