The IL
Disassembler.
This book explains the internal
workings of a disassembler. The programs given in the book produces an output
similar to the one written by Microsoft i.e. ildasm. The only difference is
that the source code of ildasm is not available. Our main objective in this
book is to write innumerable programs, which ultimately focus on understanding
the disassembler in a simplistic form. The final program has been tested
against 5000 .net files.
Without getting into any more
discussions, lets start with the disassembler right away. The output produced
by our program will be tested with that of ildasm simultaneously. This is more
to verify the results and keep us on the right track.
a.cs
public class zzz
{
public static void Main()
{
}
}
>ildasm /all /out:a.txt a.exe
Program a.cs is the smallest C#
program which on compiling gives the smallest .Net executable, a.exe. If you
fail to understand the above C# program or have forgotten how to compile a C#
program, we request you to stop reading this book now. This book assumes that
you know nothing about a disassembler but you must have a basic understanding
of the C# programming language.
Once the executable is created,
proceed further to write the first program in the series of the disassembler.
Program1.csc
using System;
using System.IO;
public class zzz
{
int [] datadirectoryrva;
int [] datadirectorysize;
int subsystem;
int stackreserve ;
int stackcommit;
int datad;
int sectiona;
int filea;
int entrypoint;
int ImageBase;
FileStream mfilestream ;
BinaryReader mbinaryreader ;
long sectionoffset;
short sections ;
string filename;
int [] SVirtualAddress ;
int [] SSizeOfRawData;
int [] SPointerToRawData ;
public static void Main (string [] args)
{
try
{
zzz a = new zzz();
a.abc(args);
}
catch ( Exception e)
{
Console.WriteLine(e.ToString());
}
}
public void abc(string [] args)
{
ReadPEStructures(args);
DisplayPEStructures();
}
public void ReadPEStructures(string [] args)
{
filename = args[0];
mfilestream = new FileStream(filename ,FileMode.Open);
mbinaryreader = new BinaryReader (mfilestream);
mfilestream.Seek(60, SeekOrigin.Begin);
int startofpeheader = mbinaryreader.ReadInt32();
mfilestream.Seek(startofpeheader, SeekOrigin.Begin);
byte sig1,sig2,sig3,sig4;
sig1 = mbinaryreader.ReadByte();
sig2 = mbinaryreader.ReadByte();
sig3 = mbinaryreader.ReadByte();
sig4 = mbinaryreader.ReadByte();
//First Structure
short machine = mbinaryreader.ReadInt16();
sections = mbinaryreader.ReadInt16();
int time = mbinaryreader.ReadInt32();
int pointer = mbinaryreader.ReadInt32();
int symbols = mbinaryreader.ReadInt32();
int headersize= mbinaryreader.ReadInt16();
int characteristics = mbinaryreader.ReadInt16();
sectionoffset = mfilestream.Position + headersize;
//Second Structure
int magic = mbinaryreader.ReadInt16();
int major = mbinaryreader.ReadByte();
int minor = mbinaryreader.ReadByte();
int sizeofcode = mbinaryreader.ReadInt32();
int sizeofdata = mbinaryreader.ReadInt32();
int sizeofudata = mbinaryreader.ReadInt32();
entrypoint = mbinaryreader.ReadInt32();
int baseofcode = mbinaryreader.ReadInt32();
int baseofdata = mbinaryreader.ReadInt32();
ImageBase = mbinaryreader.ReadInt32();
sectiona= mbinaryreader.ReadInt32();
filea = mbinaryreader.ReadInt32();
int majoros = mbinaryreader.ReadInt16();
int minoros = mbinaryreader.ReadInt16();
int majorimage = mbinaryreader.ReadInt16();
int minorimage = mbinaryreader.ReadInt16();
int majorsubsystem= mbinaryreader.ReadInt16();
int minorsubsystem = mbinaryreader.ReadInt16();
int verison = mbinaryreader.ReadInt32();
int imagesize = mbinaryreader.ReadInt32();
int sizeofheaders= mbinaryreader.ReadInt32();
int checksum = mbinaryreader.ReadInt32();
subsystem = mbinaryreader.ReadInt16();
int dllflags = mbinaryreader.ReadInt16();
stackreserve = mbinaryreader.ReadInt32();
stackcommit = mbinaryreader.ReadInt32();
int heapreserve = mbinaryreader.ReadInt32();
int heapcommit = mbinaryreader.ReadInt32();
int loader = mbinaryreader.ReadInt32();
datad = mbinaryreader.ReadInt32();
datadirectoryrva = new int[16];
datadirectorysize = new int[16];
for ( int i = 0 ; i <=15 ; i++)
{
datadirectoryrva[i] = mbinaryreader.ReadInt32();
datadirectorysize[i] = mbinaryreader.ReadInt32();
}
if ( datadirectorysize[14] == 0)
throw new System.Exception("Not a valid CLR file");
mfilestream.Position = sectionoffset ;
SVirtualAddress = new int[sections ];
SSizeOfRawData = new int[sections ];
SPointerToRawData = new int[sections ];
for ( int i = 0 ; i < sections ; i++)
{
mbinaryreader.ReadBytes(12);
SVirtualAddress[i] = mbinaryreader.ReadInt32();
SSizeOfRawData[i] = mbinaryreader.ReadInt32();
SPointerToRawData[i] = mbinaryreader.ReadInt32();
mbinaryreader.ReadBytes(16);
}
}
public void DisplayPEStructures()
{
Console.WriteLine();
Console.WriteLine("// Microsoft (R) .NET Framework IL Disassembler. Version 1.0.3328.4");
Console.WriteLine("// Copyright (C) Microsoft Corporation 1998-2001. All rights reserved.");
Console.WriteLine();
Console.WriteLine("// PE Header:");
Console.WriteLine("// Subsystem: {0}",subsystem.ToString("x8"));
Console.WriteLine("// Native entry point address: {0}",entrypoint.ToString("x8"));
Console.WriteLine("// Image base: {0}",ImageBase.ToString("x8"));
Console.WriteLine("// Section alignment: {0}",sectiona.ToString("x8"));
Console.WriteLine("// File alignment: {0}",filea.ToString("x8"));
Console.WriteLine("// Stack reserve size: {0}",stackreserve.ToString("x8"));
Console.WriteLine("// Stack commit size: {0}",stackcommit.ToString("x8"));
Console.WriteLine("// Directories: {0}",datad.ToString("x8"));
DisplayDataDirectory(datadirectoryrva[0] , datadirectorysize[0] , "Export Directory");
DisplayDataDirectory(datadirectoryrva[1] , datadirectorysize[1] , "Import Directory");
DisplayDataDirectory(datadirectoryrva[2] , datadirectorysize[2] , "Resource Directory");
DisplayDataDirectory(datadirectoryrva[3] , datadirectorysize[3] , "Exception Directory");
DisplayDataDirectory(datadirectoryrva[4] , datadirectorysize[4] , "Security Directory");
DisplayDataDirectory(datadirectoryrva[5] , datadirectorysize[5] , "Base Relocation Table");
DisplayDataDirectory(datadirectoryrva[6] , datadirectorysize[6] , "Debug Directory");
DisplayDataDirectory(datadirectoryrva[7] , datadirectorysize[7] , "Architecture Specific");
DisplayDataDirectory(datadirectoryrva[8] , datadirectorysize[8] , "Global Pointer");
DisplayDataDirectory(datadirectoryrva[9] , datadirectorysize[9] , "TLS Directory");
DisplayDataDirectory(datadirectoryrva[10] , datadirectorysize[10] , "Load Config Directory");
DisplayDataDirectory(datadirectoryrva[11] , datadirectorysize[11] , "Bound Import Directory");
DisplayDataDirectory(datadirectoryrva[12] , datadirectorysize[12] , "Import Address Table");
DisplayDataDirectory(datadirectoryrva[13] , datadirectorysize[13] , "Delay Load IAT");
DisplayDataDirectory(datadirectoryrva[14] , datadirectorysize[14] , "CLR Header");
Console.WriteLine();
}
public void DisplayDataDirectory(int rva, int size , string ss)
{
string sfinal = "";
sfinal = String.Format("// {0:x}" , rva);
sfinal = sfinal.PadRight(12);
sfinal = sfinal + String.Format("[{0:x}" , size);
sfinal = sfinal.PadRight(21);
sfinal = sfinal + String.Format("] address [size] of {0}:" , ss);
if (ss == "CLR Header")
sfinal = sfinal.PadRight(67);
else
sfinal = sfinal.PadRight(68);
Console.WriteLine(sfinal);
}
}
On compiling the above
program, program1.exe is generated. Now run the executable as
>Program1 a.exe
This command gives the
following output.
Output
// Microsoft (R) .NET Framework IL Disassembler. Version 1.0.3328.4
// Copyright (C) Microsoft Corporation
1998-2001. All rights reserved.
// PE Header:
// Subsystem: 00000003
// Native entry point
address: 0000227e
// Image base: 00400000
// Section alignment: 00002000
// File alignment: 00000200
// Stack reserve size: 00100000
// Stack commit size: 00001000
// Directories: 00000010
// 0 [0 ] address [size] of Export Directory:
// 2228 [53
] address [size] of Import Directory:
// 4000 [318
] address [size] of Resource Directory:
// 0 [0 ] address [size] of Exception Directory:
// 0 [0 ] address [size] of Security Directory:
// 6000 [c
] address [size] of Base Relocation Table:
// 0 [0 ] address [size] of Debug Directory:
// 0 [0 ] address [size] of Architecture Specific:
// 0 [0 ] address
[size] of Global Pointer:
// 0 [0 ] address [size] of TLS Directory:
// 0 [0 ] address [size] of Load Config Directory:
// 0 [0 ] address [size] of Bound Import Directory:
// 2000 [8
] address [size] of Import Address Table:
// 0 [0 ] address [size] of Delay Load IAT:
// 2008 [48
] address [size] of CLR Header:
Since time immemorial, the
first function to be called is Main. In this function, to begin with, an
instance of class zzz is created and then a non- static function abc is called
from it. The only reason for placing the bulk of our code in the abc function
is that the Main function is static. It cannot access instance variables till
an instance of its class is not created.
We promise that it is for
the first and the last time in this book that we will use names like zzz and a.
Henceforth we will abide by big meaningful names for variables/objects. Another
simple rule that we have adhered to is that if a variable is to be used by
another function, then it is made a global or an instance variable. Global in
the C# world is a no-no but in the C++ world is allowed. Therefore at times,
the names may sound legally wrong but they are morally right.
The abc function is given
an array of strings that hold the arguments assigned to the program. In our
case, it is the name of the .Net executable that is to be disassembled. While
writing code, there are possibilities of making errors. A dialog box pops up
each time an error is encountered which at times get extremely irritating. For
this purpose, the code in Main is enshrined within a try catch to simply
display the exceptions.
Now to understand the functioning
of abc.
The array variable args[0]
contains the name of the file to be disassembled which is saved in an instance
variable, filename.
The .Net world has a
million classes to handle files of which we have presently used only two. The
first one is the FileStream class. The constructor of this class simply takes
two parameters, the filename and an enum FileMode. The enum specifies how the
file should be opened. This enum takes three values which decide whether the
file is to be opened, created or overwritten. In the good old days of C,
numbers or strings were used for discreet values, however the modern world of
today prefers the enums instead. If you honestly ask us, we would prefer the
old days anytime, but we all have to move ahead with time, embrace the new and
forget the old ways.
Since the file is to be
opened, the value of Open in the enum is used. An exception is thrown if the
file does not exist. The handle to the file is stored in an instance variable
suitably named mfilestream. The only problem with the FileStream class is that
other than opening a file, it does nothing. It has a few rudimentary functions
that enable reading a byte from a file. However they are of no use to us since
our interest lies in reading a short or an int or a string from the file.
Therefore, another class BinaryReader, which permits reading primitive objects
like shorts, ints and longs from the file is used. The constructor of this
class requires the mfilestream handle. It is the BinaryReader class that will
be used and not the FileStream class in order to access the file.
The file format used by any
Windows application is called the PE or Portable Executable file format. Before
Windows evolved to become the big daddy of operating systems, the earlier king
of the hill was DOS. Each and every executable file started with the two bytes
of M and Z. This is how the DOS operating system would recognize an executable
file. The advent of windows did not in any sense change the mindset of people
thus they did not acknowledge the difference between the two operating systems.
Very often a a windows program was executed in the DOS environment.
DOS being a primitive
operating system normally checks the first two bytes and on not seeing the
magic numbers M and Z, it displays a confusing message ‘Bad Command or File
Name’. This led to some confusion, thus as a conscious decision, the makers of
the PE file format mandated that every PE file would start with a valid DOS
header. This header was then followed up with a program that printed a valid
error message if the program was to be executed in the DOS environment. The DOS
box of windows is a simulation of the original DOS.
The actual PE header of the
file starts at bytes 60. This location takes an int thus the first four bytes
are clubbed up together and indicate the start of the PE header. This offset is
not a fixed value as different compilers decide on the error messages for the
DOS program and thus change the length of the message. Using the Seek method of
the FileStream class, the file pointer is positioned to the 60th
byte in the file. The second parameter of the Seek function is an enum that
takes three values. These values decide whether the number specified in the
first parameter is an absolute offset from the beginning or end of the file or
a relative offset to the file pointer.
The file pointer is an
imaginary construct that points to the current or the next byte to be read. The
offset is stored in a variable startofpeheader and its value normally is 128.
As mentioned earlier this value can vary depending upon the compilers used. The
Seek method is used again to jump to the start of the PE header. The ReadByte
method is then implemented from the BinaryReader class to read each byte. The
magic number for a PE header is P and E followed by two zeroes i.e. ‘PE00’.
This magic number is
followed by a structure called the standard COFF header. COFF is the Common
Object File Format. The first two bytes or short is the machine or better still
the CPU type that this executable or image file can run on. An executable can
either run on the specified machine or a system that emulates it. The PE
specifications are available on the Microsoft site which specifies all possible
values that the various structures can have, hence we will not irk you with
these details.
In our case, the hex value
displayed is 0x14c which stands for an Intel 32 bit machine. This value has not
been displayed in the output for the simple reason that ildasm does not display
the value and we have decided to follow the ildasm program to the T. This value
is stored in a local variable called machine, it is not an instance variable.
The method ReadInt16 is used to read a short or two bytes from a file. This
method from the BinaryReader class is used to fetch bytes from the file. Thus
using the BinaryReader class saves us the hassle of reading bytes and then
doing their multiplications.
The second field is the
number of sections in the PE file. A PE file contains different types of
entities like code, data, resources etc. Each entity or section needs to be
stored in a different part of the PE file, therefore structures are used to
keep track of all them. The next short gives the number of sections and the
value received for the file is three. Some time later, the sections will have
to saved in structures and hence the variable sections is an instance. This is
followed by the date time stamp which gives information when this file was
created. The method ReadInt32 is used to extract this 4 byte value.
This is followed by a 4
byte entity that is a pointer or offset to the symbol table. The next int is
the number of symbols available. The value of the pointer to symbol table is
zero, which means an absence of the symbol table. Symbol tables are present
only in obj or object files. In the good old days the compilers created an obj
file and linkers created exe file from obj files. In the .net world the obj
file are obsolete and hence these two int’s are always zero.
After the first header, is
another header called the image optional header. This header is never seen in
obj files and its size can also vary but so far its been a constant value at
224 bytes.
Then comes a field called
characteristics, which specifies the attributes of the file. The value received
is 0x10e.
Bit diagram
Individual bits in a byte
carry different bits of information. The value of 0xe or 14 has a bit pattern wherein the 2, 3 and 4th
bits are on.
Bit Diagram
This signifies that the
file is a valid executable ( bit 2), there are no COFF line number present in
the file or have been stripped off ( bit 3) and the symbol table entries are
also absent( bit 4). A value of 0x100 signifies that the machine running the
executable is based on a 32 bit architecture. This value, which is the last member
of the structure, is not displayed by the ildasm utility.
The section table begins
immediately after the image optional header, i.e. thus it is after the start of
the optional header plus the size of the optional header. The variable
sectionoffset has been used to store this value, thus it can be used to jump to
the section table as and when required.
The optional header has the
first field of a short type, which represents the magic number. This can take
any of the two values, 0x10b if it follows the PE format which presently is the
case. The other value is 0x20b when the header is of a PE32+ format. This value is generally seen when
files use 64 bit addresses.
In the optional header, the
information is divided into three distinct parts. The first 28 bytes is part of
the standard PE header, the next 68 bytes applies to the windows operating
system only and the final bytes are for the data directories. The second and
third field of the standard header are the major and minor linker verison
numbers which presently have a value of 6 and 0. This is followed by the size
of the code block in the exe file. The size of initialized and uninitialized data
follows next.
The displayed value of
0x227e is for the next field called entrypoint. This value is relative to where
the program is loaded in memory or image base. In our case, since the file is
an exe file, the instruction at this value becomes the first memory location
that gets executed by the Operating System. In case of a device driver, there
is no such specific function to be called, and hence it is the address of the
initialization function. A DLL does not have to have an entry point and thus
may have a value of 0.
The base of code and base
of data are similar to the entrypoint field, which reveal the code or data area
when loaded in memory, all relative to the image base. The ImageBase field is a
logical address that points to the area where the Operating System loads the
exe file.
Similar to our likes and
dislikes, the OS prefers a value 0x00400000 as an address for executables, for
a DLL it is 0x10000000
and for Windows CE it is 0x00010000. These starting addresses can be changed by
supplying an option to the linker ant it must be a multiple of 64 K.
Nevertheless, it is not advisable experimenting with different values.
The
next value of 2048 is the section alignment. The above value signifies that
even when a section has a size of 100 bytes, the OS will allocate a minimum of
2048 bytes for it. The rest of the bytes in the memory area allocated remain
unused. This section alignment is normally the page size of the machine and is
used for purposes of efficiency. Similar to the Section alignment is the file
alignment field that applies to the file stored on disk. The file alignment is
displayed as 512 bytes, which implies that each section when stored on diske
takes up at least 512 bytes on disk, on disk 512 bytes make up one sector.
The next fields are the
major and minor numbers of the Operating System, the image and the subsystem.
The next field called verison is reserved. The following field is a size of all
the code plus headers, followed by a field that only stores the size of all the
headers including section headers. The next field called checksum helps the
Operating System detect whether the file has been damaged or tampered before it
can be loaded into memory.
The next field of subsystem
displayed by ildasm informs the Operating System of the minimum subsystem
required of it by the exe file. A value of 3 in our case means a console
subsystem, therefore no Graphical User Interface please; whereas a value of 2
would mean a Graphical User Interface system. The field dllflags applies to
DLL’s as the name signifies.
Following the field
dllflags are two fields that deal with the stack. The stack is an area in
memory, which is used to pass parameters to functions and create local
variables. The stack memory is reused at the end of a function call and hence
it is short-term memory whereas the heap area is for long duration. The second
field called stackcommit is the amount of memory that is allocated to the
stack. The value seen is 0x1000 bytes which is the stack reserve memory given
to the application. Thus initially stack commit is allocated and once this gets
used, one page at time is allocated dynamically, till the stack reserve is
used. The two fields after the stack field are not displayed as they deal with
the heap area in memory. The documentation is pretty candid that the loader
field is obsolete.
The last field of the
optional header gives the number of data directories following. So far only a
value of 16 is seen. Lets now understand the concept of a data directory.
A data directory is nothing
but two fields, the first field is a location or what is technically called an
RVA (Relative Virtual Address) that gives information as to where some data
starts in memory. The second field is
size in bytes of the entity. These are stored back to back.
Two arrays of size 16 and
data type int are created to store the RVA’s and sizes of each data directory
entry. If the 14th data directory entry has a size of zero, then it is
conclusive of the fact that the executable file is not created by a .Net
compiler. In such a case, there is no reason to continue further, so the
program is made to throw an exception and then gracefully quit out. The
reasoning will be catered to a couple of paragraphs down the road.
The section headers start
immediately after the data directories. However, we take no chances and use the
Position property of the FileStream class, to give the current position of the
imaginary file pointer. The Position property is read/write thus it not only
gives the details about the imaginary file pointer but also sets it to a new
position if need be.
The Seek method can be used
again, like before to jump to a part of the file, but as variety is the spice
of life, we set the Position property instead. The world of computer
programming lets us skin a cat multiple ways.
All the fields of the section headers are not important except three of
them, so we create three arrays of ints to store the three fields.
The first field is the
virtual address or RVA of the section in memory (We remember our promise to
explain it), this is followed by the size of the section and finally the
location on disk where the section is located. The size of a section header is
40 bytes. The three fields of our interest start 12 bytes from the start of the
header, so using the ReadBytes function, the first 12 bytes are skipped. Then,
the next three fields are read into the array variables. Since the remaining 16
bytes too have no significance, the last 16 bytes are skipped. We could have
used the Seek function to jump over the 24 bytes that we are not interested in.
Then again, we decided to use a method that is easiest to explain to you. The
data directory and the section headers are now saved in arrays.
The next function
DisplayPEStructures finally displays these values on the console. The only
stumbling block here is that the output should match that of ildasm and just to
remind you ildasm displays its output in a formatted manner. What we have is
the Shared Source code, which comes with the source code of a disassembler and
not the actual code of ildasm. The code when executed in no sense displays the
output similar to that of ildasm. Thus we had no choice but to spend a lot of
time figuring out how many spaces need to be placed at different points in the
line.
A byte by byte comparison
with the output generated by the original ildasm program can surely indicate our
follies. Thus we decided to take this approach as otherwise there is no other
way of knowing whether the code we have written works or not. To pursue it further, we wrote our own file
compare program to check whether the output generated by our disassembler and
that of ildasm is the same, however you have an option of choosing any file
compare program to suit your needs.
After displaying a new
line, the version number of the disassembler is displayed. In our case the
version is 1.0.3328.4, however yours could be larger or smaller, so please make
the appropriate changes. Then the values of 7 variables viz, subsystem, image
base, sectiona, filea, the stack variables and the number of data directories
are displayed
Initially, we have entered
the spaces manually for alignment purpose wherein numeric variables by default
are displayed in decimal and using the ToString function present in the object
class. There are a myriad of formats that can be put to use. The small x is
used for the hexadecimal numbering system with the alpha characters displayed
in small and not caps. The number 8 right justifies the number and fills up the
rest with zeroes.
The sixteen data
directories are displayed using a function DisplayDataDirectory. This function
takes the rva and size of the element in the array alongwith a string to denote
the name of the data directory. The prime objective of this function is to
format the output and display it in a certain manner.
The string sfinal does not
have to be initialized to a null string. However, we do the same out of habit
since C# does not permit using an uninititalized variable on the right hand
side of the equal to sign or as a return value.
Thereafter, using the
static Format function from the String class, the rva of the data directory is
displayed. The curly braces is a format option used by the WriteLine function
and the 0 is the placeholder for the first parameter. The colon following is used to specify the formatting. The small
x is for a hexadecimal output.
The open square [ brackets
must be placed 12 spaces away, and hence the PadRight function is used to pad
12 spaces to the string. The entire line to be displayed is then finally stored
in the string sfinal and then given to the WriteLine function to display it in
one go. Then using the Format function the size of the data directory is
emitted out but after having considered 21 spaces, to synchronize with the
ildasm output. Thereafter, the name of the data directory is displayed. Now for
some quirks. For some reason the last data directory is not displayed, the
second last is the CLR header.
For this data directory,
ildasm places 67 spaces before displaying it whereas for the others, after
displaying them, 67 spaces are inserted till the end of the line. For this purpose,
an if statement that checks the name of the data directory is introduced which
decides on the spaces that are to be padded to the string before writing it
out. To verify every byte displayed is similar to the output displayed by
ildasm, we had to cater to ever space seen also. Thus we had no choice but to
spend lots of time getting the spaces right. Now that the first program is
over, the output can be compared with that of the disassembler and to check
that it matches it to a T.
Even though the .Net
documentation very clearly specifies that the MS_DOS stub should be exactly 128
bytes large, not all .Net compilers follow the documentation. This
documentation also specifies the values that most fields must have.
In the standard PE header
the Machine field must always be 0x14c. The Date Time field is the number of
seconds since 1st Jan 1970 i.e 00:00:00 and the Pointer to Symbol
table and number of symbols must always be 0. The final field Characteristics
has the following bits 0x2, 0x4, 0x8, 0x100 set and the rest 0. The bit 0x2000
is set for a dll and cleared for an exe file.
The PE standard header
fields are now set as follows. The Magic number is 0x10b. The Major and Minor
version numbers are 6 and 0. The Code and Data sizes have the same meanings as
explained earlier. The RVA must point to bytes 0xff 0x25 followed by a RVA of
0x4000000 or 0 for a DLL.
The section that it falls in must have the attributes execute and read. The
Base of Code is 0x00400000 and 0 for a DLL and the base of Data is the data
section.
Every
exe file has a starting memory location that contains the first executable
instruction which is called the entry point. Windows 98 for example does not
understand a native .Net executable and hence it is called a non-CLI platform.
The words CLI will be repeated a trillion times and its full form is Common
Language Infrastructure.
For
an exe file, the first function to be called is CorExeMain and for a dll it is
_CorDllMain, the code of which resides in the library mscoree.dll. It is this
function that understands a .Net executable, thus we believe that in future
this function will reside in the operating system. It is this function that
understands concepts like IL and metadata which we will explain in course of
time.
The
Windows-specific fields have the following values. The image base as mentioned
earlier is 0x400000, the section and file alignment are 0x2000 and 0x200
respectively. The OS Major version is 4 and Minor version is 0. The User Major
and Minor versions are 0. The Sub-System Major version is 4 and Minor version
0. The Reserved field is always 0. The Image Size is a size in bytes of all
headers plus padding and it has to be a multiple of Section Alignment. The Header Size is the size of three
headers, DOS, PE header and optional PE header. This also includes padding and
must be a multiple of the File Alignment value. The Checksum and DLL flags must
be zero and the Subsystem can take a value of 2 or 3 only. The Stack reserve
has a value of 1Mb and stack commit is 4K. The heap Reserve and Commit have the
same values also. The Loader flags are 0 and the Number of Data Directories are
16.
Most
of the data directories have an RVA value but with a size of 0. These are the
Import, Resource, Exception Certificate, Debug, Copyright , Global Ptr, TLS
Table, Load Config, Bound Import, Delay Import table and the last that is
reserved. The four directories that may have some size are the Import, Base
Relocation , IAT and finally the CLI Header.
The
section headers immediately follow the optional headers since there is no entry
in the PE headers that point to the section headers. The name of the section is
what the section headers start with and it is 8 bytes large. Therefore there is
no terminating null when the length of the section name is 8 characters.
Normally section names start with a dot, for example, the section containing
code is called .text and that containing data is called .data. The second field
is called Virtual Size and it is a multiple of the section alignment. The field
stores the size of the section when the section is loaded in memory. The fourth
field is the SizeOfRawData. If this field is greater than the fourth, the
section is zero padded.
The
third field VirtualAddress is an RVA and thus relative to the image base. It
determines where the section is loaded in memory. The size of Raw Data is the
fourth field and it is the size of the initialized data on disk, thus a
multiple of the file alignment. As this field is rounded to the file alignment
and not section alignment like the virtual size, it cannot be greater than the
Virtual Size field. If the section contains only initialized data then the
value stored in this field is 0. The PointerToRawData field is a RVA to the
first page within the PE file and thus is a multiple of File Alignment.
The
next field is the Pointer to Relocations that is the rva of the relocation
section or .reloc. The Pointer to Line Numbers that follows is zero and the
Number of Relocations is the actual count number of relocations. The second
last field is the Number of Line numbers that is obviously zero. Finally there
is the characteristics that determines one of six possible attributes of the
sections. These attributes decide whether the section carries executable code,
initialized data, uninitialized data, is executable or read or write.
To stress test our disassembler,
we have looked at other languages also. Thus, if you are conversant with only
one language, you may find it a little difficult to stress-test your program.
The ones who are learned about the C++ programming language can attempt the
next program in sequence. About the software, fear not, cause if
you have installed Visual Studio.net the C++ compiler called cl also get
installed.
a.cpp
main()
{
}
a.cpp is a C++ program that simply contains one
function called main. There are two
dissimilarities between this cpp program and the smallest C# program. Firstly,
in C++, all functions need not be in a class hence they are made global. This
is one sensible thing in C++ that was amended in C# and Java, mainy due to
political reasons. The second difference is that the m of main is small and not
caps as in C#. This is done after consulting a dozen numerologists.
Compile the above cpp file to an exe file by
running the following command.
Cl /clr a.cpp
The /clr option creates a .Net executable. If for
some reason you cannot get the above program compiler, worry not as we have
gone the C++ way to get some more output and some more executables.
Our disassembler finally
will have beyond 10,000 lines of code. There is no way in heaven or hell, the
entire program can be explained in one go. Even God and ourselves would find it
difficult to understand what we are saying. So please follow instructions to
the letter T.
Program2.csc
public void abc(string [] args)
{
ReadPEStructures(args);
DisplayPEStructures();
ReadandDisplayImportAdressTable();
}
public void ReadandDisplayImportAdressTable()
{
long stratofimports =
ConvertRVA(datadirectoryrva[1]);
mfilestream.Position = stratofimports;
Console.WriteLine("// Import Address
Table");
int outercount = 0;
while (true)
{
int rvaimportlookuptable =
mbinaryreader.ReadInt32();
if ( rvaimportlookuptable == 0)
break;
int datetimestamp = mbinaryreader.ReadInt32();
int forwarderchain = mbinaryreader.ReadInt32();
int name = mbinaryreader.ReadInt32();
int rvaiat = mbinaryreader.ReadInt32();
mfilestream.Position = ConvertRVA (name);
Console.Write("// ");
DisplayStringFromFile ();
Console.WriteLine("// {0} Import Address Table" ,
rvaiat.ToString("x8"));
Console.WriteLine("// {0} Import Name Table" ,
name.ToString("x8"));
Console.WriteLine("// {0} time date stamp" , datetimestamp);
Console.WriteLine("// {0} Index of first forwarder reference" , forwarderchain);
Console.WriteLine("//");
long importtable = ConvertRVA(rvaimportlookuptable
) ;
mfilestream.Position = importtable;
int nexttable = mbinaryreader.ReadInt32();
if ( nexttable < 0 )
{
Console.WriteLine("// Failed to read import
data.");
Console.WriteLine();
outercount++;
mfilestream.Position = stratofimports + outercount
* 20;
continue;
}
int innercount = 0;
while ( true
)
{
long pos0 = ConvertRVA(rvaimportlookuptable) +
innercount * 4;
mfilestream.Position = pos0 ;
int pos1 = mbinaryreader.ReadInt32();
if ( pos1 == 0)
break;
long pos2 = ConvertRVA(pos1);
mfilestream.Position = pos2 ;
short hint = mbinaryreader.ReadInt16();
Console.Write("// ");
if ( hint.ToString("X").Length == 1)
Console.Write(" {0}" , hint.ToString("x"));
if ( hint.ToString("X").Length == 2)
Console.Write(" {0}" ,
hint.ToString("x"));
if ( hint.ToString("X").Length == 3)
Console.Write("{0}" ,
hint.ToString("x"));
Console.Write(" ");
DisplayStringFromFile();
innercount++;
}
Console.WriteLine();
outercount++;
mfilestream.Position = stratofimports + outercount
* 20;
}
Console.WriteLine("//Delay Load Import Address
Table");
if (datadirectoryrva[13] == 0)
Console.WriteLine("// No data.");
}
public long ConvertRVA (long rva)
{
int i;
for ( i = 0 ; i < sections ; i++)
{
if ( rva >= SVirtualAddress [i] && ( rva < SVirtualAddress[i] + SSizeOfRawData [i] ))
break ;
}
return SPointerToRawData [i] + ( rva -
SVirtualAddress[i] );
}
}
public void DisplayStringFromFile()
{
while ( true )
{
byte filebyte = (byte )mfilestream.ReadByte();
if ( filebyte == 0)
break;
Console.Write("{0}" , (char)filebyte);
}
Console.WriteLine();
}
// Import Address Table
// KERNEL32.dll
// 00006000 Import Address Table
// 000079bc Import Name Table
// 0 time date
stamp
// 0 Index of first forwarder reference
//
// 167 GetModuleHandleA
// fd GetCommandLineA
// 1a8 GetSystemInfo
// 35d VirtualQuery
// mscoree.dll
// 000060e4 Import Address Table
// 000079d8 Import Name Table
// 0 time date stamp
// 0 Index of first forwarder reference
//
// 5a _CorExeMain
// Delay Load Import
Address Table
// No data.
The program program2.csc is
not shown in full. Only those functions that are new or changed are displayed.
Any instance variables added will also be shown. For example in the above since
we have introduced a function call to a new function ReadandDisplayImportAdressTable in the
abc function, the abc function is displayed again. The ReadPEStructures
function undergoes no change and hence is not shown at all.
Our disassembler also does not aim at winning any prizes in any
competition on speed and efficiency. The main objective of the program is to
help you understand the workings of a disassembler. Once this objective is
achieved, then modifications can be made for it to work faster. We have
scarified speed at the altar of understanding.
This program displays the import table. In the programming world
we share, share and share. Thus other programmers write code that is placed
functions in dll’s and we mortals call those functions in our code.
Microsoft
Windows comes with 100’s of such dll’s that contain code and expect the
programmers to use these functions while coding. These dll’s have names like
user32.dll, kernel32.dll etc. Every C# program eventually calls code in these
dll’s.
Besides,
Microsoft also allows programmers to create their own dll with their set of
functions and have other coders call them. When the linker creates an exe file
it list out all the dll’s that the exe file is calling code from.
Simultaneously, within these dll’s, there is a list that enlists the functions
that are being called. Thus, before executing any program in memory, the
operating system needs to also load the dll’s mentioned in the import table in
memory and check for the function in the executable with its corresponding
entry in the dll.
In order to
display the contents of the import directory, rva and the size are required.
The second data directory gives the rva and size of the import directory. The
function ReadandDisplayImportAdressTable then figures out
the Dll names and displays them as prescribed by the ildasm program.
The RVA or a relative virtual address is a number
that represents some entity in the memory. This location is where the runtime
loader will place the entity in memory. The file addresses are not significant
because the PE file format is optimized for memory. Thus using the RVA, it is
pertinent to figure out where on disk the import directory begins.
Function ConvertRVA comes to aid as it will convert
an RVA into a physical address. In the last program, three
section header details were stored in three different arrays alongwith the
number of sections in a variable called sections. This function ConvertRVA is
passed a memory location as a long that is to be converted into a disk based
address. As arrays start from 0, the for loop begins from 0 and ends when it
reaches the number of sections minus 1. In the loop, the parameter passed i.e.
rva is checked to be greater than the value of the array member SVirtualAddress
and at the same time less than the same value plus the second array
SSizeOfRawData.
The check is performed because the array
SVirtualAddress stores the starting rva that this section is associated with
and SSizeOfRawData is the size of the data of this section. Thus, the section
headers report the memory occupied by each header. The third member is the
SpointerToRawData, which is also the address of the start of the section, but
on disk. This approach helps in deciphering the rva the section belongs to and
once an equal match is attained, the loop is abruptly terminated. The
SPointerToRawData value cannot be the return value as it is the starting
position of the section on disk, therefore the rva parameter is subtracted from
the SVirtualAddress or starting rva in memory. This offset is then added to the
SSizeOfRawData value. Bear in mind that this works on the assumpution that a
valid rva is given and hence no error checks are performed.
Thus in short, the above workings are as follows.
The starting rva’s of each section and the length of the data of the section is
available. In the for loop, the rva passed is checked to be in the range of
each section. If so, the difference is added to the disk location where this
section starts. In other words, an RVA is the address of an entity after the
loader loads it in memory. Obviously the address where the image is loaded in
memory or Image Base is subtracted from the RVA. This is because the image can
be theoretically loaded anywhere in memory.
The method adopted for locating the physical file
location, given a RVA, is taken from the documentation that comes with the Tool
Developers Guide.
The wrongly spelt variable stratofimports tells is
where on disk the import table begins. This value is given to the Position
property. In this case, the variable in not needed but hey nobody is charging
us for an extra variable.
Two loops have been given to display the import
table since there are two different entities to display. The outer loop applies
to selecting each and every dll, one at a time and the inner loop is to display
the names of the functions from the chosen dll. The variable outercount is
initialized to 0 and it will be used later in the program.
Every dll has a structure called the Import
Directory Table that represents the details of the functions that are being
imported and its size is 20 bytes wide. The 20 bytes are well categorized in 5
fields. The first field is the address of the Import Lookup Table that gives
the name of the function that is being imported from each dll.
If this value is zero, then it signifies that the
Import Directory table has ended and the outer loop is to be terminated. The
second field is for the date time stamp and is always zero. When the exe file
is loaded into memory, the loader sets this field to the data time stamp of the
Dll. The third field is the index of the first forwarder reference
and its value is also zero. The fourth field is the name of the Dll and the
address is an RVA and hence the ConvertRVA function is
used which convert the address into a physical memory location. The Position
property is directly set to this value directly and then the function
DisplayStringFromFile is used to display the DLL name which is stored in ASCII
format. The last field is an RVA of the Import Address Table. This table is
similar in content to the Import Lookup Table and only changes after
the image is loaded into memory or bound. This value may be the last field of
the structure but displayed first.
Lets first move to the function DisplayStringFromFile.
The function starts with an indefinite while loop and simply fetches one byte
at a time from the file. This function assumes that the file pointer is placed
on the first byte and does not attempt to save the file position.
It then uses the Write function to display the byte
as a character by using the char cast. If the byte picked up is zero, it means
that the end of the dll names has been reached and the loop is then terminated.
Before we quit out of the function, we have ended with a new line. We could
have instead returned a string but chose not to for no particular reason.
The values of the structure members like the Import
Address Table, the Import Name Table and the Date Time stamp and the forwarder
are then displayed. Then using the loop construct, the names of the functions
from the dll’s are displayed. The
variable rvaimportlookuptable gives the rva of the Import Lookup Table and
using the ConvertRVA function this rva is converted into a physical location on disk.
As the innercount variable value is 0, the multiplication
yields a zero. The position property is set to this value thereby having the
file pointer positioned at the start of the table. The Import Lookup Table is a
set of int’s, one for each function being imported. The value of the int being
zero is an indication that the table is over, thus we quit out of the inner loop.
The 31st bit is the most crucial and if
it is set, i.e. has a value of 1, then the importing is by ordinal values or
number, and if it is not set then the importing is by name. Our hypothesis is
that we are importing by name and hence the int read is taken to be an RVA to a
Hint Name table, otherwise it is a simple ordinal number.
One more reason as to why the 31st bit
is not checked for is that till date we have not encountered a single .Net
executable that imports functions from a DLL by its ordinal value. Thus writing
code that checks for imports by ordinal value is baseless since it can never be
verified for its accuracy. Also, the .Net world unlike the good old C/C++ world
does detest playing around with the internals of sections.
The int picked up is converted into a physical
location of the Hint Name Table. This table is of variable length with the
first short as the Hint field. The second field is the name of the function
stored as an ASCII string.
After attaining the size of the hint, the spacing
is determined. We could have done the formatting using inbuilt functions but
chose the brute force method just to tell you that as long as it works, use it.
However if the options get to many, then the above method gets to tedious. Then
the string class and ToString function offers a more elegant solution than our
clumsy way.
Once the name of the function is displayed, the
innercount variable is increased by 1.
On returning to the start of the loop, the next
task is to display the second function name and hint. But, the problem is that
the file pointer is currently positioned at the end of the name of the first
function, hence it is not on the right byte. Therefore, there is no alternate
approach but to jump to the start of the Import Lookup Table and then determine
the rva of second function.
The
ConvertRVA function moves to the start of the table and as innercount is
one and the size of each field is 4, the rva is now of the second function. We
are very much aware that the above is not an elegant way, but it works. We could have stored the original file
pointer position before moving the file pointer around.
An enter key is emitted after moving out of the
inner while loop. Then the variable outercount is increased by 1. This variable
keeps a count of the dll’s that have been scrutinized.
Bear in mind that before looping back to the outer
for loop, the file pointer must be positioned at the start of the appropriate
Import Directory Table. Thus, the same procedure is adopted again where using
the variable stratofimports we move to the start of the Import Directory Table
structures. Since each structure is 20 bytes long, we multiply the number of
dll’s already completed with the number that is stored in the outercount
variable. In this manner, the Import table is then completely displayed.
The second table that ildasm displays is the Delay
Load Import Address. The RVA and size for these are stored in the 14 Data
Directory. After examining around 5000 .Net executables, we realized that not
one of them had this table. The reason being that in the .Net world it is not
acceptable to write and create our own sections and hence this table does not
get created. The linker is endowed the responsibility to create this table.
You may be a bit surprised to see the if statement
checking the value of the variable nexttable. There is one file
system.enterpriseservices.thunk.dll that gives an error with the dll
oleaut32.dll while displaying the imports. The error check is to print the
error and continue with the program.
After having determined the start of the import
table, the rva of the hint name table is picked up. If this rva for some reason
is less than 0, we display an error message and then go back to the start of
the next dll. An rva represents a memory location and therefore cannot be
negative. This error occurs only with one dll and one dll within it.
For the import address table, the Date Time stamp
and the Forwarder Chain fields are zero as per the specification. Remember a
zero denotes the end of the table. The hint field should also be zero. A point
to be noted is that the name of the function is case sensitive. The names
_CorExeMain and _CorDllMain are decided by the specifications.
Continued
---