Home

Surveys
Transformation
Reengineering
DSL
Domain Engineering
Decompilation
Generative Progr.

Collections
Categories
Systems
Conferences
People
Companies
Papers

Bob Stout Refutation

Program-Transformation.Org: The Program Transformation Wiki

This page is my refutation of a Frequently Asked Question answer on decompilation. The original page is difficult to find now in its complete form, so I have archived it here. As with the other refutation page, I sincerely mean no disrespect to Bob Stout, Jeremy Coffin, or other commentators. This sort of attitude to decompilation is unfortunately common. Ultimately, the naysayers may be correct in a very broad sense, i.e. it may be that machine code decompilers will never be easy enough to use to become mainstream. However, they are wrong in many technical details, so I'll use this opportunity to correct the facts. The reader can decide whether machine code decompilation will eventually become useful to them.

Text from the FAQ answer (the text I'm refuting) will appear in red ; my refutation will be in ordinary text.

G.3.17 decompil.txt

+++Date last modified: 05-Jul-1997

Question:

Is there any hope of a decompiler that would convert an executable program into C/C++ code?

Answer:

Don't hold your breath. Think about it... For a decompiler to work properly, either 1) every compiler would have to generate substantially identical code, even with full optimization turned on,

If you are expecting the original source code back, certainly that is impossible. Even the original source code with comments removed and generic variable and function names is impossible. But why demand the original source code? Surely any equivalent source code will do just as well. The more readable the source code, the better. It is conveivable that a decompiler could produce source code that is more readable than the original source code, if it was poorly written in the first place.

There are indeed many ways that compilers can express the semantics of a source program; for correct compilations, all of these will be equivalent. A correct decompiler will produce equivalent source code. Very likely, the source code produced from an optimised version of a program will be different from the source code generated from an unoptimised binary file. Ironically, optimised binary code is easier to deal with in some ways (e.g. less dead code to eliminate).

or 2) it would have to recognize the individual output of every compiler's code generator.

You seem to be assuming that a decompiler has to "reverse" the operation of the compiler. That is certainly one approach, but general decompilers don't work with patterns in the compiled code. Instead, they apply a series of program transformations. The original executable file is a program, albeit one expressed in a binary, human unfriendly language. Some transformations (e.g. recovering parameters) require extensive analysis. Others, e.g. converting a goto statement into a do-while loop, are very standard graph transformations that are routinely applied to source code.

A few of these transformations will depend on patterns in the input executable; these are "idioms". These are constructions where the meaning is not readily derived from their semantics, e.g. using subtract with carry to effect a min() or max() function. These are usually not specific to a particular compiler or optimisation level, however. (However you are more or less likely to see some of these at different optimisation levels).

If the first case were to be correct, there would be no more need for compiler benchmarks since every one would work the same.

Agreed. So that's not the case.

For the second case to be true would require in immensely complex program that had to change with every new compiler release.

No, again, this is assuming that decompilers have to reverse the operation of compilers, with each compiler and optimisation level as a separate case.

OK, so what about specific decompilers for specific compilers - say a decompiler designed to only work on code generated by, say, BC++ 4.52? ...

Again, this is still making an invalid assumption.

OK, let's simplify further and specify that you only want to support one specific compiler and you want to decompile to the most logical source code without trying to interpret the optimization. What then?

You've actually touched on a valid consideration here: what does the "most logical source code" really mean? The shortest? The most readable? The "nicest"? Without being able to measure the quality of decompiled output, it's difficult to measure progress. You could also look at it this way: any valid source code will do. Just use source to source transformations (programmers do this all the time, they refactor their code), add comments, rename variables and functions, until it is as pretty as you need.

A good optimizer can and will substantially rewrite the internals of your code, so what you get out of your decompiler will be, not only cryptic, but in many cases, riddled with goto statements and other no-no's of good coding practice. At this point, you have decompiled source, but what good is it?

Again, a good point, but the details are incorrect. Certainly, optimised machine code can be more difficult for a human to read (e.g. as a disassembly), but that doesn't necessarily imply that my the time it is decopiled it is any harder to read. A good decompiler will (without specifically trying to do this) "unoptimise" the code. As a trivial example, a subexpression kept in a register will be replace with the original subexressions by expression propagation. The original assignment to the register will disappear with dead code elimination.

As for gotos, that's a structuring issue. Structuring is a largely solved problem. You can eliminate gotos in any code, including unstructured code, at the cost of either code duplication or adding booleans. The exact same thing applies at the source code level. Even the worst spaghetti source code with gotos can be transformed into something without gotos. In a few rare cases, it may be clearer to leave a few gotos there. You can parameterise the structuring stage of a decompiler to generate any of these three options (i.e. allow duplicate code, generate booleans, or allow limited gotos to remain).

Code quality is very important, and quality can deteriorate for a variety of reasons, but these aren't the right reasons. It is my (as yet unproven) belief that reasonable quality code can be generated for almost any program. You can of course always construct a program that will defeat any particular decmpiler. Can that decompiler always be modified to accept all existing accepted programs plus this constructed one? I suspect so, but there are presumably an infinite number of programs that could be constructed to break decmpilers, so there will never be a perfect decompiler.

Also note carefully my reference to source modules. One characteristic of C is that it becomes largely unreadable unless broken into easily maintainable source modules (.C files). How will the decompiler deal with that? It could either try to decompile the whole program into some mammoth main() function, losing all modularity, or it could try to place each called function into its own file. The first way would generate unusable chaos and the second would run into problems where the original source had files with multiple functions using static data and/or one or more functions calling one or more static functions. A decompiler could make static data and/or functions global but only at the expense or readability (which would already be unacceptable).

You can generate one large source program with all functions in it (not just one large main function, that's just crazy). Humans can chop the large file into pieces, copying appropriate pieces into appropriately named header files.

Indeed, I don't see this as being impossible to automate, either. It's just a book-keeping problem. Computers are good at that. So I don't see this as any kind of show-stopper.

Also, remember that commercial applications often code the most difficult or time-critical functions in assembler which could prove almost impossible to decompile into a C equivalent.

Again, this seems to be assuming that you are reversing the operation of a compiler. Hand written machine code is almost indistinguishable from compiled C code in the input binary, and decompiles just the same. In fact, parts of the original program can be written in a variety of procedural languages (e.g. C, C++, Pascal, Modula, Ada) and all these pieces can be decompiled into a suitably expressive language (e.g. C). Decompiling declarative or functional programming code is a different matter.

Closely related to the issue of modularity is that of library code. Consider the ubiquitous "Hello world" program. After compilation it contains about 10 bytes of compiled source, about a dozen bytes of data, and anywhere from 5-10K (depending on compiler, target, memory model, etc.) of start up and library code.

Yes, this is a problem, but it is solvable. In this case, you unfortunately do have to be compiler specific, and recognise sttically compiled library functions as binary patterns. The commercial disassmbler IDA Pro does this already, showing that is is practical to do. The dcc decompiler applied this technique also.

Again, the situation with C++ would be orders of magnitude more complex trying to make sense of the compiled code once the O-O structures and relationships had been compiled into oblivion. Even if you take the simple approach and decompile C++ into C, would anyone like to try and trace through the source to figure out a cout call which adds another 7-10K of overhead vis-a-vis a printf() call? I sure wouldn't!!!

C++ certainly adds more library functions, and there is the issue of name mangling, which is compiler specific. All solvable.

More problematic are templates. I believe that this can be solved with code clone analysis (a standard source code technique), plus some high level pattern matching. User templates could similarly be handled with clone analysis.

So what do your have? For a small program, you'd wind up trying to decipher what is mostly library source. For a large program, you'd wind up with either 1) one humonguous main(), or 2) lots of arbitrary single-function modules from which all notions of static data and functions would have been lost (contributing to a vast pool of global data), which would still include decompiled source for all the library objects as well. In any scenario, is any of this useful? Probably not.

With the recognition of static library functions (which incidentally is a valuable source of type information) and the clone analysis etc, you would not have any (or very little) library source. Not decompiling the library code means not seeing data specific to the libraries, as well.

While we've touched on the topic of library code, here's yet another reason that C and C++ are particularly difficult to de-compile: macros.

For instance, if I have something like:

    while (EOF != ( ch = getchar())) {
        if (isupper(ch))
            putchar(ch);

getchar, EOF, putchar and isupper are all typically macros, something like:

#define EOF -1
#define isupper(x) (__types[(unsigned char)x+1] && __UPPER)
#define getchar()  (getc(stdin))
#define putchar(c) (putc((c),stdout)

#define getc(s) ((s)->__pos<(s)->__len?     \
    (s)->__buf[__pos++]:                    \
      filbuf(s))
#define putc(c,s) ((s)->__pos<(s)->__len?   \
    (s)->__buf[__pos++]=(c):                \
    putbuf((s),(c)))

Yes, this is a nuisance. Note that a correct decompiler will produce correct, working source code, so that's something. The common cases can be covered with high level pattern matching. Decompilers already have to recognise a lot of patterns (e.g. X + 0 => X), so this is just a few more high level patterns to add to the list. It is my belief that a good decompiler has to have these "rules" organised in an easy to read file, so that the list can readily be extended.

Finally, stdin and stdout are generally just items in an array of FILE pointers something like:

FILE __iobuf[20];

FILE *stdin = __iobuf;      // This part is done silently by the
FILE *stdout = __iobuf + 1; // compiler, without actual source code
FILE *stderr = __iobuf + 2;

Even if you just expand the macros and never actually compile the code at all, you end up with something that's basically unreadable. However, this is what actually gets fed to the compiler, so it's also absolute best you could ever hope for from a perfect de-compiler.

Again, only if you assume that you have to reverse the compiler. I believe that these can be handled with a few more high level patterns. Note that these high level patterns are not compiler specific; they should work for all compilers, so you only have to fix it once.

C++ of course adds in-line functions and after an optimizer runs across things, the code from the in-line function may well be mixed in with surrounding code, making it nearly impossible to extract the function from the code that calls it.

I believe that clone analysis handles this issue. Compilers already do something similar when they inline the code; they have to change parameters into variables, for example. Maintainers of legacy code regularly look for clones, and convert them into parameterised calls. I can't believe that inlining can't be undone.

Inlining is acually a help when it is applied to constructors; it helps to find the virtual table for a function call, and hence the class that the callee belongs to, and hence the type of the object pointer.

Removing calls to constructors and destructors (when added automatically by the compiler) is another issue. I suspect it can be solved with language dependent high level patterns.

There are only a few formats in use for vtables, which would help in preserving virtual functions, but inline functions would be lost, so you'd typically end up with hundreds of times that code would be directly accessing variables in other classes.

Virtual function calls are a significant problem, but they can be analysed with techniques such as expression propagation and data flow analysis. Assuming that clone analysis can convert inlined calls back to standard calls, you would end up with either of two things:

1) if there are virtual calls to the same callee, it would be recognised as a class member function, as you would want.

2) if there are no virtual calls to that callee, you would end up with a non-class function that takes a member of that class as its first parameter. That would be trivial to convert to a member function (automatically or manually).

Like I said, don't hold your breath. As technology improves to where decompilers may become more feasible, optimizers and languages (C++, for example, would be a significantly tougher language to decompile than C) also conspire to make them less likely.

I don't believe that optimisation poses a problem for decompilers, so I certainly don't believe that compilers will "stay a jump ahead" of decompilers for that reason.

For years Unix applications have been distributed in shrouded source form (machine but not human readable -- all comments and whitespace removed, variables names all in the form OOIIOIOI, etc.), which has been a quite adequate means of protecting the author's rights. It's very unlikely that decompiler output would even be as readable as shrouded source.

Yes, and there is also obfuscated source code, e.g. see the annual obfuscated C code contest. Certainly, I have myself generated source code that is much easier to read than obfuscated or shrouded code.

There is a big difference between obfuscated code and automatically generated code. With obfuscated code, you use deliberately crazy techniques such as compressing strings, passing negative integers as the first parameter of main, and so on. With automatically generated code, it (can be) just the same as real source code, just with generic names and no comments. Perhaps in the far future, machines will be able to generate more useful names, and add useful comments. Presumably, these names will always be consistent, and the comments always correct, unlike human written source code.

Shrouded code seems to me to be readily converted into more readable forms; I've never tried. A "pretty printer" program would convert it into well indented code. If they have done nothing else to obfuscate the code, then pretty printed srounded code would presumably look similar to automatically decompiled code.

What's so wrong with that? Sure, it's hard to read; a lot of real source code is hard to read. You can compile it, so that opens up a whole range of applications right there (e.g. port to other architecture, optimise, make minor changes, fix bugs, scan more readily for vulnerability issues, etc). Plus, you have the opportunity to make the code maintainable, by adding comments, changing names, etc. You could not realistically do this with the binary program.

Despite it not being a magic bullet, there are plenty of applications for machine code decmpilation.

A general purpose decompiler is the Holy Grail of tyro programmers.

This is probably true; novice programmers do dream of a general purpose machine code decompiler. But it's a dream that could well come true; I happen to believe that it's only a matter of time and hard work. Perhaps I could reply in a similar vein that ill conceived answers such as I have refuted above are the result of rookie investigators of the surprisingly complex subject of decompilation.

[by Bob Stout & Jerry Coffin]

Refuation by MikeVanEmmerik.