Problems with naive XOR cipher (RSS challenge)

[HOME] [TXT]

===[ XOR crash course ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

(In most programming languages operator for XOR is represented by caret '^'
and that symbol will be used in article below.)

XOR (eXclusive OR) is boolean operation that takes 2 bits as input and outputs
one bit. Operation is simple: if we XOR the same bit values together we get 0
and if we XOR the opposite bit values we get 1.

Truth table:

  a ^ b = c
  ---------
  0   0   0
  0   1   1
  1   0   1
  1   1   0


There are two important things from the table above:

1. Number XORed by zero is always the same number: 'a ^ 0 = a'

2. Number XORed by the same number is always zero: 'a ^ a = 0'

If we have equation 'a ^ b = c' we can play with it like that:

  a ^ c   =   a ^ (a ^ b)   =   a ^ a ^ b   =   0 ^ b   =   b

  b ^ c   =   b ^ (a ^ b)   =   b ^ b ^ a   =   0 ^ a   =   a

Why would we want to do that? Well, let's say we have some plain text message
'msg', that we want encrypt by the key 'key', but message is longer than
the key, what now? We will solve it by galaxy brain idea: let's divide the
message into blocks which have the same size as the key and XOR that:

------------------------------[ block_cipher.py ]------------------------------
i = 0
while i < len (msg):
    cipher = msg ^ key
    i += len (key)
-------------------------------------------------------------------------------

We are safe as long as nobody finds out our key, right? There is just one
teensy little problem, if we XOR the key with same data multiple times, we get
the same result:

  key = 0b1011
  plain_text_block_01 = 0b0111
  plain_text_block_02 = 0b0111

  output_01 = key ^ plain_text_block_01 = 0b1100
  output_02 = key ^ plain_text_block_02 = 0b1100

That means that repeating patterns in a message with combination with simple
XOR encryption are ... bad. With little information about type of message, we
can easily break it.

===[ Challenge example ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's find out repeating patterns. I this case file and the key are small
enough that we can see the pattern just by looking at hexdump:

-----------------------------------[ code ]------------------------------------
;,73;0*?*710;&.;,73;0*?*710;&.;,73;0*?*710
,73;0*?*710;&.;,73;0
-------------------------------------------------------------------------------

Based on how is the sequence of bytes repeated, we can assume a few things:

-----------------------------------[ code ]------------------------------------
;,73;0*?*710;&.
;,73;0*?*710;&.
;,73;0*?*710
 ,73;0*?*710;&.
;,73;0
-------------------------------------------------------------------------------

1. The key is encrypting the same bytes over and over or it is encrypting a
repeating sequence that has the same length as the key. (The first one is
easier to check, so we'll start with that.)

2. From that it looks like our key has length of 15 bytes.

3. We still don't know the correct order of the bytes.

4. From previous "challenges" we will suppose that the encrypted message makes
some sense, i.e. it is an ASCII printable text.

This knowledge can be enough for us to brute force it (if it fails we can do
frequency analysis, but we can get lucky). Let's do it!

At first we can easily determine the correct order of byte of the key just by
counting from start of the file to the sequence. From a hex editor we know that
the sequence ';,73;0*?*710;&.' starts at offset '0x4E' (= 78) [1] and we
know from how it is repeating, that the key has length of 15 bytes. So we can
divide 63 by 15 and get the reminder (for that we'll use modulo operator):

  78 % 15 = 3

[1] We want the second occurrence, because the first one can be XORed with
shifted key (that's how blocks work).

The offset is '78 - 3 = 75' and we read 15 bytes from offset 75 and we get a
new sequence:

  ;&.;,73;0*?*710

In the next step, we need to realize that all sequences in the message was
created like this:

  plain_text_block ^ key = cipher_text_block

So, of we take the first 15 bytes of the cipher text and XOR them with bytes of
the sequence above, we get:

  cipher_text_block_0 ^ cipher_text_block_4 =
    = (plain_text_block_0 ^ key) ^ (plain_text_block_4 ^ key) =
    = plain_text_block_0 ^ plain_text_block_4

If we are correct in our assumption and 'plain_text_block_4' is composed of
just one byte repeated 15 times, we can brute force it only in 256 interactions
(that is the total number of bits in one byte => 8 bits), because when we use
the equation above, key will get zero and the rest will be effectively XORed
with the same one byte. When find that byte, we will automatically get the
plain text.

Here is annotated code, that gives us the solution:

---------------------------[ breaking_naive_xor.py ]---------------------------
# We know the key length.
key_len = 15

# We assume that this sequence of bytes are the same bytes XORed by the key.
sequence = b';&.;,73;0*?*710'

# Read the binary into memory (file is small, so we can store all it into mem).
with open ('xor.bin', 'rb') as f:
    cipher_text = f.read ()

# Iterate over all bytes. (By this we are searching for a byte in the
# sequence.)
for n in range (0, 256):

    # We'll write every possible "solution" to a separate file.
    with open (f'/dev/shm/xor.out.{n:02x}', 'wb') as f:

        # Write the byte to the beginning of a file,
        f.write (bytes (f'key = {n:c} (0x{n:02x})\n', 'u8'))

        # This is our XOR trick (cipher_text2 = sequence):
        #
        #   cipher_text1 ^ cipher_text2 ^ n =
        #       = (plain_text1 ^ key) ^ (plain_text2 ^ key) ^ n =
        #       = plain_text1 ^ plain_text2 ^ n
        #
        # If 'n' is the same byte as 'plain_text2', XORing them together will
        # give us 0 and 0 XOR 'plain_text1', gives us 'plain_text1'.

        # Iterate over 'key_len' chunks and XOR corresponding bytes.
        for idx in range (0, len (cipher_text) - key_len, key_len):
            plain_text = []
            for i in range (0, key_len):
                # XOR trick
                pt = cipher_text[idx+i] ^ sequence[i] ^ n
                plain_text.append (pt)

            # Write plain text to a file.
            f.write (bytes (plain_text))

            ## For those who understand list comprehension all in this level
            ## of indentation can be replaces by:
            #f.write (bytes (cipher_text[idx+i] ^ sequence[i] ^ n for i in range (0, key_len)))
-------------------------------------------------------------------------------

And that's it. Code will write 256 files into '/dev/shm/', we just need to
find the correct one.

===[ Frequency analysis ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We were (kind of) lucky by finding such nice repeating sequence of bytes
encrypted by a short key. Actually it's not unusual to find e.g. firmware that
uses the same type of "encryption" (especially in older update patches). Is can
be even easier to get a key, because there are often lots of zeroes in a
payload and what happens when you XOR message with zero? As you have probably
guessed -- the key.

But I digress. Let's say we are not that lucky and we have just regular English
text or ZIPped files. If we know (or guess) something about an original
message, we ca use statistical technique called "frequency analysis". I will
not go into more details (look it up), but this generalize the method we had
used in this article. Only we are not looking for a single byte sequence, but
for the most common sequence of bytes (e.g. one of to most often bigram in a
English text is 'th').