Quantitatively comparing AST shapes

big datamachine learningparsingsyntax

How could one compare the shape of abstract syntax trees of similar source code programs (C, C++, Go, or anything compiled with GCC…)?

I guess that plagiarism detection on source code would use such techniques, but I have no idea of how would that be called…

For example, unification could be used to compare AST, but it gives only a boolean answer. I'm seeking for some technique giving some numerical "distance", or some kind of numerical vectors (to be later feed up e.g. to machine learning or classification algorithms, or some other big data thing).

Any references to big data or machine learning approaches on large set of source code is welcome too.

(Sorry for such a broad or fuzzy question, I don't know what terminology to use)

I don't simply want to compare two ASTs or programs. I want to process a large set of programs (e.g. half of a Debian distribution source code) and find inside it similar routines. I already have MELT to work on GCC internal representations (Gimple) and I want to leverage above that, hence store several metrics (which ones? cyclomatic complexity is probably not enough) in e.g. some database and compare & process them…

Addenda: Found about the MOSS system & paper, but it does not seem to care about syntactic shape at all. Also looking into tree edit distance.

Found also (thanks to Jérémie Salvucci) Michel Chilowicz's PhD thesis (in French, november 2010) on Looking for Similarity in Source Code

Best Answer

One approach would be to compile the source to XML and then look at how different the two bits of source are. For example, in the Java world, the static analysis tool pmd does this as its approach to looking for things to warn about.

class Example {
 void bar() {
  while (baz)
   buz.doSomething();
 }
}

gets 'compiled' to:

CompilationUnit
 TypeDeclaration
  ClassDeclaration:(package private)
   UnmodifiedClassDeclaration(Example)
    ClassBody
     ClassBodyDeclaration
      MethodDeclaration:(package private)
       ResultType
       MethodDeclarator(bar)
        FormalParameters
       Block
        BlockStatement
         Statement
          WhileStatement
           Expression
            PrimaryExpression
             PrimaryPrefix
              Name:baz
           Statement
            StatementExpression:null
             PrimaryExpression
              PrimaryPrefix
               Name:buz.doSomething
              PrimarySuffix
               Arguments

And that point you would be comparing code by saying "the difference between this code and that code is that this name is different." As the above is actually xml, this could be done with any number of xml comparison tools that exist. Or if you were after a number, one could apply a tree edit distance algorithm on it (related SO question).


Another approach is to look at the 'signature' of the code shape. The Signature Survey was done by Ward Cunningham

alt text

That legend is a bit hard to read:

  • 14m means 14 methods
  • 294L is 294 lines.
  • . is a non blank line
  • ' is a comment
  • | (green) is a single line if statement.
  • (.) (green) is a single statement inside an if block
  • [(.)] (brown) is a single statement inside of an if inside a loop.
  • {.} is a method with a single statement.
  • [.] (red) is a single statement inside a loop
  • ([.]) (dark red) is a single statement inside a loop inside an if block.

Comparing two sets of code then is looking at the edit distance between two strings with a very limited language.

Related Topic