Let's understand what a Disassembler is ?
It basically converts the byte codes of a class file i.e. machine code of JVM to its actual source code i.e. the java program. The Dissassembler is not a separate program.A few function which are added in our main program which reads the class file. Thus this is the right time to understand what these functions do. NOTE:This Dissassembler has its own limitations. It only displays variables & their initializations, airthematic operations on these variables, strings, for loop, if loop, nested if & for loops, methods, their signatures. But you can very well incorporate other unicodes in the methods to convert into java code. This is an attempt made by us to make people understand how a java Dissassembler can be written. As of now all the information of the class file is already stored in the structures. Now we will try to arrange this information so that we get our desired java source code. In source.cpp, we have seven functions which carry out disassembling. These functions are namely methsign(), v(), cview(), method(), arrange(), rtype() & catstr(). Look at the last three lines in main(). There is for statement and there is function call cview() within the for statement. Now cview() will be called depending on the number of the methods that are present in the class file, as meth_c denotes the numbers of methods in the class files. Let's look at cview() as it is the first functions which gets called from main(). The function takes an argument that is a pointer to the structure "meth" which has the informatiion about the methods i.e. method signatures, returns type etc.. The source code in a particular method also is in the form of byte codes and c_len signifies the code length. It is the member of structure cinfo which has the information of the code in the particular method. The code is stored in array 'c'. First we have to start by displaying the return type of the method, name of the method and arguments if any. This is done by function methsign(). It takes three arguments namely the name of the method, its signature and array to get the results. The question which arises here is when we have the name and signature of the methods in the structure 'cinfo', then why can't we display it directly? The problem is that the information is in the raw format i.e method name, signature are in differnt variables. Moreover the signature is in in the form "( ) v",where the return type specified at the last, but we want the output to look like "return type method name ( arguments) " and hence we have the function meth sign which arranges and concatenates the strings and gives the desired results. The arguments mi.name, mi.sign are actually ints and as usual we have to use get() to extract the actual string from the constant pool. Then the arrangment is done by several strcpy() and strcat() and requried output is stored in 'ss'. Where 'ss ' actually is a reference for proto in cview and the next line i.e printf(), prints the desired output e.g. void abc ( int x, int y) . Hey, whats the function rtype() doing in methsign(). If you carefully have a look at the output of bytes.cpp you will find that the return type in the method signature(in constant pool) is just one character i.e void -'v', int -'i'etc. So to convert this return type character into its equivalent word we have the function rtype(). It has been passed, the character and variable which gives the no. of the parameter passed in methods as arguments. Now this function will return the actual return type of the method i.e. void ,int ,double etc. . Now we have to display the actual code. But the big problem is code is that the code is in bytes form. As we know that JVM has a 256 instruction set, each instruction is represented by a particular no. between 0-255, i.e.177 means return void, 16 means to push one byte signed integer on stack and so on. So these instructions, in the form of bytes codes, have to be decoded to obtain its equivalent Java Code. This decoding is done by function method(). The method function takes 5 parameters namely a pointer to an array which stores all the codes, length of the code, a pointer to an array of no. of variables, pointer to a pointer of a chars which stores the actual strings which represents the code & the reference variable 'ii' to store the no.of lines of the code. The most crucial part in this method ( ) function is the switch statement. In switch statement, each and every instruction has its case. For e.g. 3,4,5,6,7,8 belong to same category and hence they have same program lines. NOTE: All 256 instruction have not been decoded in the program. But one can easily add cases as and when required just by knowing what a particular instruction does. So,write the java program and get the bytes by executing bytes.cpp and figure out how the code is stored in bytes form. If you go through the method function carefully you will find that within the various cases function catstr(), v() and arrange() are called several number of times. So let us look at this functions and figure out what they do in the program . First we start with a simple function v(). As we know that the instructions belonging to the same category have tthe same case but the instruction having same case may represet an object variable or an integer variable. So to display the appropriate variable ie. for integer variable, print "ii" else print "oo" for an object, we have the function v() which takes two parameters one a character "i" or "o" and the number of the variable. Then depending upon the character it will return a string with appropriate name. That's it and now for the second simple function ie.catstr( ). Here we are actually reading each instruction(byte code) and displaying its appropriate java code in text. So now if we want to display int ii1=0; we have to take each string & dispay it one after another, i.e. first display 'int' then 'ii1' ,operator '=' & '0'. This is too complicated. And we can have a function which takes all these strings as a parametres and concatenates them appropriately and stores it as a single string. This is what catstr() does. It takes any no. of arguments(strings), concatenates them, returns the length of the string and stores the string in the given pointer to it. Now lets look at interesting function called arrange ( ). Suppose we have any airthematic statement in our Java program, lets say c = a + b. Then the problem is that the compiler will give appropriate bytes according to Djikstra's algorithm. Which will store the above equation as a b +. If we have equation d = a + b - c then Djikstra's algorithm outputs a b + c - and so on. So we have to convert a b + into a + b and hence we have the function arrange( ) . This function takes a pointer to the actual bytes as an argument. Then each byte is figured out whether it's a variable(number) or an operator. As soon as operator is found, bytes are arranged to get the normal output a + b. Now suppose we have an equation a + b - c * d + e according to Djikstra's algorithm the equation will be stored as a b + c -d * e +. So now the first operator from left is traced i.e. + and it is placed between the first two variables. So the equation becomes a+bc-d*e+. After that operator ' - ' is taken but the last variable is (a+b) and ' - ' is placed between (a+b) & c.Thus the equation becomes a+b-cd*e+. This how arrange function extracts the actual airthematic expression.This is what the Disassembler is all about. Though it is incomplete, one can always write more complicated Java programs, compile them, get the actual byte codes and then add more functions to the Disassembler program accordingly to make this Disassembler more efficient. NOTE: This was just an attempt made to write a Disassembler. So now if anyone who knows,the actual bits & bytes generated by Java compiler can write a Java Compiler.
Vijay Mukhi's Computer
Institute
VMCI, B-13, Everest Building, Tardeo, Mumbai 400 034, India
Tel : 91-22-496 4335 /6/7/8/9
Fax : 91-22-307 28 59
e-mail : vmukhi@giasbm01.vsnl.net.in
http://www.vijaymukhi.com