00000000: 0101 0501 0200 0201 0103 0001 0204 0601  ................
0000                                                           ....
0000  .=====================================================.  ....
0000  |   0000:  01 01 05           setr  r1, 0x05       /\ |  ....
0000  |   0003:  01 02 00           setr  r2, 0x00       || |  ....
0000  |   0006:  02 01 01           subr  r1, 0x01       || |  ....
0000  |   0009:  03 00 01 02        cmpr  NE, r1, r2     || |  ....
0000  \   000d:  04 06              jmpi  0x06           \/ /  ....
0000   \_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_-_/   ....
0000                                                           ....
000000a0: 0001 0204 0601 0105 0102 0002 0101 0300  ................

===[ Quick and Dirty Disassembler in Python3 ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If we are doing custom VM reverse engineering, we need to understand the VM
code we are analyzing. Fortunately, creating a custom disassembler is not that
hard to create, especially if the opcode width is fixed. In this article, we
will look at a simple 8-bit VM bytecode and how to build a disassembler for it.

Some terminology:

  * VM       -- Virtual Machine.

  * Opcode   -- The instruction code that tells the (v)CPU what operation to
                perform.

  * Operand  -- An argument to an instruction (can be a register or a number).

  * Mnemonic -- The human readable name of an instruction (e.g., ''mov'').

===[ 8-bit VM Bytecode Example ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Reverse engineering is typically an iterative process, but for simplicity's
sake, let's assume we already know the following:

  1. How instructions are encoded, how registers are used, and so on.

  2. What is the meaning of the opcodes (= what each instruction does).

  3. Both opcode and operand widths are fixed at 1 byte (8-bits) each.

The number of operands (arguments) depends on the instruction tyep. It can
range from zero to five.

We will demonstrate this using the following bytecode hexstring:

  010105010200020101030001020406

It is a simple loop that terminates when the value in register ''r1'' is
decremented to zero. The equivalent higher-level code looks like this (see
[ref1] for the assembler):

---------------------------------[ loop.nasm ]---------------------------------
; ASSEMBLY              ; PSEUDO CODE
setr (r1, 5)            ; r1 = 5
setr (r2, 0)            ; r2 = 0

loop:
  subr (r1, 1)          ; r1 = r1 - 1
  cmpr (NE, r1, r2)     ; if r1 != r2:
  jmpi (loop)           ;    goto loop
-------------------------------------------------------------------------------

And here is the annotated hexstring of the bytecode:

-------------------------------[ loop.hex.txt ]--------------------------------
setr  r1  5
01    01  05        ; r1 = 5

setr  r2  0
01    02  00        ; r2 = 0

subr  r1  1
02    01  01        ; r1 = r1 - 1

cmpr  NE  r1  r2
03    00  01  02    ; if r1 != r2:

jmpi  6
04    06            ;    goto loop
-------------------------------------------------------------------------------

===[ Parsing With Python ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Parsing bytecode using Python3 is pretty straightforward. We will open the file
in binary mode, iterate over the bytes, and determine the instruction type for
each:

# Read the entire file into memory so we can easily index into it
bytecode = open ('loop.bin', 'rb').read ()

idx = 0

while idx < len (bytecode):
    opcode = bytecode[idx]

Now we need to implement:

  1. Mnemonics for the opcodes -- We want to see ''setr'' instead of ''0x01''.

  2. Operand handling -- determine the number of operands, their values, and/or
     their names.

This can be done using a hash array (Python dictionary). From the annotated
bytecode above, we know:

  - There are 4 instructions: ''setr, subr, cmpr, jmpi''.

  - There are 3 operand types: register, immediate value (number), and
    condition type.

  - There are 2 registers: ''r1, r2'',

For example, we can create a dictionary indexed by opcode. Each value is
another dictionary containing the instruction's name and a list of argument
types. Each argument type is simply a key into another dictionary that maps
type names to their decoding logic or metadata.

instruction = {
      # setr -- Set a register to a value.
      0x01: {'name': 'setr', 'args': [register, int8]},
      # subr -- Substract a value from a register.
      0x02: {'name': 'subr', 'args': [register, int8]},
      # cmpr -- Compare two registers.
      0x03: {'name': 'cmpr', 'args': [condition, register, register]},
      # jmpi -- Jump to an address if condition is true.
      0x04: {'name': 'jmpi', 'args': [int8]},
    }

Now we need to define the remaining types. All of them are straightforward
(except for the immediate value ''int8'' -- see below):

register = {
        0x01: 'r1',               # register 1
        0x02: 'r2',               # register 2
    }

condition = {
        0x00: 'NE',               # not equal
    }

The immediate value ''int8'' must remain a number. To handle this, we can use
an "indexable" class. One that defines a data type which returns a numeric
value. This can be easily implemented using the ''__getitem__'' metafunction:

class Int8 ():
    def __getitem__ (self, val):
        return f'0x{opcode:02x}'  # from number return a string in hex format

int8 = Int8 ()

We are almost done. We'll just finish the main code by referencing the
previously defined dictionaries:

while idx < len (bytecode):

  addr   = idx
  opcode = bytecode[idx]               # number
  name   = instruction[opcode]['name'] # name of opcode (string)
  args   = instruction[opcode]['args'] # pointer to list of types

  idx += 1                             # opcode width is 1 byte

  tmp_arr = []                         # this is used for operands

  for opcode_type in args:             # iterate over all expected operands
    opcode = bytecode[idx]             # each type is a pointer to specific dict
    tmp_arr.append (opcode_type[opcode])
    idx += 1                           # each operand has width of 1

  args = ', '.join (tmp_arr)           # convert list of strings into a string
  raw_bytes_str = ' '.join (raw_bytes)

  print (f'{addr:04x}:  {raw_bytes_str:<20} {name} {args}')

And that's it. Now, when we run the script, it will disassemble the VM's
bytecode:

$ ./disassm.py loop.bin

0000:  01 01 05             setr r1, 0x05
0003:  01 02 00             setr r2, 0x00
0006:  02 01 01             subr r1, 0x01
0009:  03 00 01 02          cmpr NE, r1, r2
000d:  04 06                jmpi 0x06

(The full code can be found at [ref2].)

A similar approach can be used for more complex instructions.

===[ Parsing Bits ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's say an instruction with arguments is encoded across multiple bits. For
example, as follows:

Instruction encoding: 5 bits => 32 possible instructions
    |   |
    v   v
  1 11111 111 1111111 ...
  ^       ^
Mode      Registers / argument(s)

To interpret such an opcode, we can read one byte, remove the two "register"
bits by shifting right, and extract the mode by masking (= ANDing with ones)
the higher bits. The mode can be obtained by shifting the first byte to the
right. The register values are formed by combining the two lowest bits of the
first byte with the highest bit of the second byte.

opcode = (bytecode[idx] >> 2) & 0b00011111
mode   = bytecode[idx] >> 7
regs   = ((bytecode[idx] & 0b00000011) << 1) | (bytecode[idx+1] >> 7)

The arguments can be obtained by removing the highest bit from the second byte
and iterating over the remaining bits:

args = bytecode[idx+1] & 0b11111111

===[ Outro ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Some bytecodes can be quite complex to parse. A good example is the x86
instruction set, where the encoding is complicated as hell. In such cases, it
is better to implement specialized decoding functions, but the logic can become
messy very quickly.

Don't worry too much about it. In reverse engineering, flexibility is
important, but it is also acceptable to rely on some "ugly" solutions, like
hardcoding all possible opcode+operand combinations. For example, in the case
above, we could treat ''cmpr NE'' as a single two-byte instruction: ''03 00''
and parse it accordingly.

Personally, I like this simple approach. I usually use it as a template, as it
is flexible enough for most VMs I have worked with and can be quickly adapted
as I uncover the VM's instructions.


Don't forget: Hack the planet!

===[ References ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

> [ref1] https://research.h4x.cz/html/2022/2022-04-27--howto_create_your_own_instruction_set_in_nasm-quick_and_dirty_vm_code.html

> [ref2] https://github.com/fandauchytil/custom_python_dissasembler