Bob Stout On Decompilation
Program-Transformation.Org: The Program Transformation Wiki
This page was rescued from Google's cache of
http://orion.planet.de/~jan/Snippets.9707/_g0311.html .
G.3.17 decompil.txt
+++Date last modified: 05-Jul-1997
Question:
Is there any hope of a decompiler that would convert an executable program
into C/C++ code?
Answer:
Don't hold your breath. Think about it... For a decompiler to work
properly, either 1) every compiler would have to generate substantially
identical code, even with full optimization turned on, or 2) it would have to
recognize the individual output of every compiler's code generator.
If the first case were to be correct, there would be no more need for
compiler benchmarks since every one would work the same. For the second case
to be true would require in immensely complex program that had to change with
every new compiler release.
OK, so what about specific decompilers for specific compilers - say a
decompiler designed to only work on code generated by, say, BC++ 4.52? This
gets us right back to the optimization issue. Code written for clarity and
understandability is often inefficient. Code written for maximum performance
(speed or size) is often cryptic (at best!) Add to this the fact that all
modern compilers have a multitude of optimization switches to control which
optimization techniques to enable and which to avoid. The bottom line is
that, for a reasonably large, complex source module, you can get the compiler
to produce a number of different object modules simply by changing your
optimization switches, so your decompiler will also have to be a deoptimizer
which can automagically recognize which optimization strategies were enabled
at compile time.
OK, let's simplify further and specify that you only want to support one
specific compiler and you want to decompile to the most logical source code
without trying to interpret the optimization. What then? A good optimizer can
and will substantially rewrite the internals of your code, so what you get
out of your decompiler will be, not only cryptic, but in many cases, riddled
with goto statements and other no-no's of good coding practice. At this
point, you have decompiled source, but what good is it?
Also note carefully my reference to source modules. One characteristic of C
is that it becomes largely unreadable unless broken into easily maintainable
source modules (.C files). How will the decompiler deal with that? It could
either try to decompile the whole program into some mammoth main() function,
losing all modularity, or it could try to place each called function into its
own file. The first way would generate unusable chaos and the second would
run into problems where the original source had files with multiple functions
using static data and/or one or more functions calling one or more static
functions. A decompiler could make static data and/or functions global but
only at the expense or readability (which would already be unacceptable).
Also, remember that commercial applications often code the most difficult
or time-critical functions in assembler which could prove almost impossible
to decompile into a C equivalent.
Closely related to the issue of modularity is that of library code.
Consider the ubiquitous "Hello world" program. After compilation it contains
about 10 bytes of compiled source, about a dozen bytes of data, and anywhere
from 5-10K (depending on compiler, target, memory model, etc.) of start up
and library code. This is a great example since printf() also calls
lots of
other library functions of its own! Once the decompiler has assigned names to
the dozen or so functions in its output, the fun starts when you have to
figure out which arbitrarily-named function is really printf() and which
other functions are library helper functions that it calls. The bottom line
here is that in order to do so, you'd have to know enough about writing C
libraries to be able to recognize the code for printf() when you see it.
Again, the situation with C++ would be orders of magnitude more complex
trying to make sense of the compiled code once the O-O structures and
relationships had been compiled into oblivion. Even if you take the simple
approach and decompile C++ into C, would anyone like to try and trace through
the source to figure out a cout call which adds another 7-10K of overhead
vis-a-vis a printf() call? I sure wouldn't!!!
So what do your have? For a small program, you'd wind up trying to decipher
what is mostly library source. For a large program, you'd wind up with either
1) one humonguous main(), or 2) lots of arbitrary single-function modules
from which all notions of static data and functions would have been lost
(contributing to a vast pool of global data), which would still include
decompiled source for all the library objects as well. In any scenario, is
any of this useful? Probably not.
While we've touched on the topic of library code, here's yet another reason
that C and C++ are particularly difficult to de-compile: macros.
For instance, if I have something like:
while (EOF != ( ch = getchar())) {
if (isupper(ch))
putchar(ch);
getchar, EOF, putchar and isupper are all typically macros, something like:
#define EOF -1
#define isupper(x) (__types[(unsigned char)x+1] && __UPPER)
#define getchar() (getc(stdin))
#define putchar(c) (putc((c),stdout)
#define getc(s) ((s)->__pos<(s)->__len? \
(s)->__buf[__pos++]: \
filbuf(s))
#define putc(c,s) ((s)->__pos<(s)->__len? \
(s)->__buf[__pos++]=(c): \
putbuf((s),(c)))
Finally, stdin and stdout are generally just items in an array of FILE
pointers something like:
FILE __iobuf[20];
FILE *stdin = __iobuf; // This part is done silently by the
FILE *stdout = __iobuf + 1; // compiler, without actual source code
FILE *stderr = __iobuf + 2;
Even if you just expand the macros and never actually compile the code at
all, you end up with something that's basically unreadable. However, this is
what actually gets fed to the compiler, so it's also absolute best you could
ever hope for from a perfect de-compiler.
C++ of course adds in-line functions and after an optimizer runs across
things, the code from the in-line function may well be mixed in with
surrounding code, making it nearly impossible to extract the function from
the code that calls it. There are only a few formats in use for vtables,
which would help in preserving virtual functions, but inline functions would
be lost, so you'd typically end up with hundreds of times that code would be
directly accessing variables in other classes.
Like I said, don't hold your breath. As technology improves to where
decompilers may become more feasible, optimizers and languages (C++, for
example, would be a significantly tougher language to decompile than C) also
conspire to make them less likely.
For years Unix applications have been distributed in shrouded source form
(machine but not human readable -- all comments and whitespace removed,
variables names all in the form OOIIOIOI, etc.), which has been a quite
adequate means of protecting the author's rights. It's very unlikely that
decompiler output would even be as readable as shrouded source.
A general purpose decompiler is the Holy Grail of tyro programmers.
[by Bob Stout & Jerry Coffin]