00000000: 0101 0501 0200 0201 0103 0001 0204 0601 ................ 0000 .... 0000 .=====================================================. .... 0000 | 0000: 01 01 05 setr r1, 0x05 /\ | .... 0000 | 0003: 01 02 00 setr r2, 0x00 || | .... 0000 | 0006: 02 01 01 subr r1, 0x01 || | .... 0000 | 0009: 03 00 01 02 cmpr NE, r1, r2 || | .... 0000 \ 000d: 04 06 jmpi 0x06 \/ / .... 0000 \_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_/ .... 0000 .... 000000a0: 0001 0204 0601 0105 0102 0002 0101 0300 ................ ===[ Quick and Dirty Disassembler in Python3 ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ If we are doing custom VM reverse engineering, we need to understand the VM code we are analyzing. Fortunately, creating a custom disassembler is not that hard to create, especially if the opcode width is fixed. In this article, we will look at a simple 8-bit VM bytecode and how to build a disassembler for it. Some terminology: * VM -- Virtual Machine. * Opcode -- The instruction code that tells the (v)CPU what operation to perform. * Operand -- An argument to an instruction (can be a register or a number). * Mnemonic -- The human readable name of an instruction (e.g., ''mov''). ===[ 8-bit VM Bytecode Example ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Reverse engineering is typically an iterative process, but for simplicity's sake, let's assume we already know the following: 1. How instructions are encoded, how registers are used, and so on. 2. What is the meaning of the opcodes (= what each instruction does). 3. Both opcode and operand widths are fixed at 1 byte (8-bits) each. The number of operands (arguments) depends on the instruction tyep. It can range from zero to five. We will demonstrate this using the following bytecode hexstring: 010105010200020101030001020406 It is a simple loop that terminates when the value in register ''r1'' is decremented to zero. The equivalent higher-level code looks like this (see [ref1] for the assembler): ---------------------------------[ loop.nasm ]--------------------------------- ; ASSEMBLY ; PSEUDO CODE setr (r1, 5) ; r1 = 5 setr (r2, 0) ; r2 = 0 loop: subr (r1, 1) ; r1 = r1 - 1 cmpr (NE, r1, r2) ; if r1 != r2: jmpi (loop) ; goto loop ------------------------------------------------------------------------------- And here is the annotated hexstring of the bytecode: -------------------------------[ loop.hex.txt ]-------------------------------- setr r1 5 01 01 05 ; r1 = 5 setr r2 0 01 02 00 ; r2 = 0 subr r1 1 02 01 01 ; r1 = r1 - 1 cmpr NE r1 r2 03 00 01 02 ; if r1 != r2: jmpi 6 04 06 ; goto loop ------------------------------------------------------------------------------- ===[ Parsing With Python ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Parsing bytecode using Python3 is pretty straightforward. We will open the file in binary mode, iterate over the bytes, and determine the instruction type for each: # Read the entire file into memory so we can easily index into it bytecode = open ('loop.bin', 'rb').read () idx = 0 while idx < len (bytecode): opcode = bytecode[idx] Now we need to implement: 1. Mnemonics for the opcodes -- We want to see ''setr'' instead of ''0x01''. 2. Operand handling -- determine the number of operands, their values, and/or their names. This can be done using a hash array (Python dictionary). From the annotated bytecode above, we know: - There are 4 instructions: ''setr, subr, cmpr, jmpi''. - There are 3 operand types: register, immediate value (number), and condition type. - There are 2 registers: ''r1, r2'', For example, we can create a dictionary indexed by opcode. Each value is another dictionary containing the instruction's name and a list of argument types. Each argument type is simply a key into another dictionary that maps type names to their decoding logic or metadata. instruction = { # setr -- Set a register to a value. 0x01: {'name': 'setr', 'args': [register, int8]}, # subr -- Substract a value from a register. 0x02: {'name': 'subr', 'args': [register, int8]}, # cmpr -- Compare two registers. 0x03: {'name': 'cmpr', 'args': [condition, register, register]}, # jmpi -- Jump to an address if condition is true. 0x04: {'name': 'jmpi', 'args': [int8]}, } Now we need to define the remaining types. All of them are straightforward (except for the immediate value ''int8'' -- see below): register = { 0x01: 'r1', # register 1 0x02: 'r2', # register 2 } condition = { 0x00: 'NE', # not equal } The immediate value ''int8'' must remain a number. To handle this, we can use an "indexable" class. One that defines a data type which returns a numeric value. This can be easily implemented using the ''__getitem__'' metafunction: class Int8 (): def __getitem__ (self, val): return f'0x{opcode:02x}' # from number return a string in hex format int8 = Int8 () We are almost done. We'll just finish the main code by referencing the previously defined dictionaries: while idx < len (bytecode): addr = idx opcode = bytecode[idx] # number name = instruction[opcode]['name'] # name of opcode (string) args = instruction[opcode]['args'] # pointer to list of types idx += 1 # opcode width is 1 byte tmp_arr = [] # this is used for operands for opcode_type in args: # iterate over all expected operands opcode = bytecode[idx] # each type is a pointer to specific dict tmp_arr.append (opcode_type[opcode]) idx += 1 # each operand has width of 1 args = ', '.join (tmp_arr) # convert list of strings into a string raw_bytes_str = ' '.join (raw_bytes) print (f'{addr:04x}: {raw_bytes_str:<20} {name} {args}') And that's it. Now, when we run the script, it will disassemble the VM's bytecode: $ ./disassm.py loop.bin 0000: 01 01 05 setr r1, 0x05 0003: 01 02 00 setr r2, 0x00 0006: 02 01 01 subr r1, 0x01 0009: 03 00 01 02 cmpr NE, r1, r2 000d: 04 06 jmpi 0x06 (The full code can be found at [ref2].) A similar approach can be used for more complex instructions. ===[ Parsing Bits ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Let's say an instruction with arguments is encoded across multiple bits. For example, as follows: Instruction encoding: 5 bits => 32 possible instructions | | v v 1 11111 111 1111111 ... ^ ^ Mode Registers / argument(s) To interpret such an opcode, we can read one byte, remove the two "register" bits by shifting right, and extract the mode by masking (= ANDing with ones) the higher bits. The mode can be obtained by shifting the first byte to the right. The register values are formed by combining the two lowest bits of the first byte with the highest bit of the second byte. opcode = (bytecode[idx] >> 2) & 0b00011111 mode = bytecode[idx] >> 7 regs = ((bytecode[idx] & 0b00000011) << 1) | (bytecode[idx+1] >> 7) The arguments can be obtained by removing the highest bit from the second byte and iterating over the remaining bits: args = bytecode[idx+1] & 0b11111111 ===[ Outro ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Some bytecodes can be quite complex to parse. A good example is the x86 instruction set, where the encoding is complicated as hell. In such cases, it is better to implement specialized decoding functions, but the logic can become messy very quickly. Don't worry too much about it. In reverse engineering, flexibility is important, but it is also acceptable to rely on some "ugly" solutions, like hardcoding all possible opcode+operand combinations. For example, in the case above, we could treat ''cmpr NE'' as a single two-byte instruction: ''03 00'' and parse it accordingly. Personally, I like this simple approach. I usually use it as a template, as it is flexible enough for most VMs I have worked with and can be quickly adapted as I uncover the VM's instructions. Don't forget: Hack the planet! ===[ References ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > [ref1] https://research.h4x.cz/html/2022/2022-04-27--howto_create_your_own_instruction_set_in_nasm-quick_and_dirty_vm_code.html > [ref2] https://github.com/fandauchytil/custom_python_dissasembler