Compiler – Why Native Machine Code Can’t Be Easily Decompiled

binarycompilerdecompilemachine-code

With bytecode-based virtual machine languages like Java, VB.NET, C#, ActionScript 3.0, etc., you hear sometimes about how easy it is to just go download some decompiler off the Internet, run the bytecode through it one good time, and oftentimes, come up with something not too far from the original source code in a matter of seconds. Supposedly this sort of language is particularly vulnerable to that.

I've recently started wondering why you don't hear more about this concerning native binary code, when you at least know which language it was written in originally (and thus, which language to try to decompile into). For a long time, I figured it was just because native machine language is so much crazier and more complex than typical bytecode.

But what does bytecode look like? It looks like this:

1000: 2A 40 F0 14
1001: 2A 50 F1 27
1002: 4F 00 F0 F1
1003: C9 00 00 F2

And what does native machine code look like (in hex)? It, of course, looks like this:

1000: 2A 40 F0 14
1001: 2A 50 F1 27
1002: 4F 00 F0 F1
1003: C9 00 00 F2

And the instructions come from a somewhat similar frame of mind:

1000: mov EAX, 20
1001: mov EBX, loc1
1002: mul EAX, EBX
1003: push ECX

So, given the language to try to decompile some native binary into, say C++, what's so hard about it? The only two ideas that immediately come to mind are 1) it really is that much more intricate than bytecode, or 2) something about the fact that operating systems tend to paginate programs and scatter their pieces causes too many problems. If one of those possibilities is correct, please explain. But either way, why do you never hear of this basically?

NOTE

I'm about to accept one of the answers, but I want to kind of mention something first. Almost everybody is referring back to the fact that different pieces of original source code might map to the same machine code; local variable names are lost, you don't know what type of loop was originally used, etc.

However examples like the two that were just mentioned are kind of trivial in my eyes. Some of the answers though tend to state that the difference between machine code and the original source is drastically much more than something this trivial.

But for example, when it comes down to things like local variable names and loop types, bytecode loses this information as well (at least for ActionScript 3.0). I've pulled that stuff back through a decompiler before, and I didn't really care whether a variable was called strMyLocalString:String or loc1. I could still look in that small, local scope and see how it's being used without much trouble. And a for loop is pretty much the same exact thing as a while loop, if you think about it. Also even when I would run the source through irrFuscator (which, unlike secureSWF, doesn't do much more than just randomize member variable and function names), it still looked like you could just start isolating certain variables and functions in smaller classes, figure out how they're used, assign your own names to them, and work from there.

In order for this to be a big deal, the machine code would need to lose a lot more information than that, and some of the answers do go into this.

Best Answer

At every step of compilation you lose information that is irrecoverable. The more information you lose from the original source, the harder it is to decompile.

You can create a useful de-compiler for byte-code because a lot more information is preserved from the original source than is preserved when producing the final target machine code.

The first step of a compiler is to turn the source into some for of intermediate representation often represented as a tree. Traditionally this tree does not contain non-semantic information such as comments, white-space, etc. Once this is thrown away you cannot recover the original source from that tree.

The next step is to render the tree into some form of intermediate language that makes optimizations easier. There are quite a few choices here and each compiler infrastructure has there own. Typically, however, information like local variable names, large control flow structures (such as whether you used a for or while loop) are lost. Some important optimizations typically happen here, constant propagation, invariant code motion, function inlining, etc. Each of which transform the representation into a representation that has equivalent functionality but looks substantially different.

A step after that is to generate the actual machine instructions which might involve what are called "peep-hole" optimization that produce optimized version of common instruction patterns.

At each step you lose more and more information until, at the end, you lose so much it become impossible to recover anything resembling the original code.

Byte-code, on the other hand, typically saves the interesting and transformative optimizations until the JIT phase (the just-in-time compiler) when the target machine code is produced. Byte-code contains a lot of meta-data such as local variable types, class structure, to allow the same byte-code to be compiled to multiple target machine code. All this information is not necessary in a C++ program and is discarded in the compilation process.

There are decompilers for various target machine codes but they often do not produce useful results (something you can modify and then recompile) as too much of the original source is lost. If you have debug information for the executable you can do an even better job; but, if you have debug information, you probably have the original source too.

Related Topic