[[/html/2025/2025-09-11--touching_small_elfs-p1-broken_tools.html|Part 1: Understanding Small ELFs and Fixing Broken Tools]]
[[/html/2025/2025-10-06--touching_small_elfs-p2-segfaults_everywhere.html|Part 2: ELF Magic Gone Wrong: Debugging SEGFAULTs (Examples of ELF Failures)]]
Part 3: Debugging Userspace ELF in the Kernel with QEMU Snapshots.

Learning objectives:
- Understanding a user space bug by debugging kernel.
- Semi time-travel debugging in QEMU.
- Watching accesses/writes into a physical memory address in QEMU.

                                :@@@@@@@@@@@@@@@:
                            @@@@@@@@@@@@@@@@@@@@@@@@#
                         @@@@@@@@@@@@@@@%@@@@@@@@@@@@@@#
                       @@@@@@@@@@@@@@@@@.@@@@@@@@@@@@@@@@@
                     @@@@@@%@@@@@@@@@@@.@=@@@@@@@@@@@@@@@@@@
                   @@@@@@%@@@@@@=@@@@@@.+.@@@@@@.@@@@@@@%@@@@.
                  @@@@@@@@@@.@@=@@@%@@=...:@@@@@@.@%*@@@@@@@@@@
                 @@=@@@@@@@@@@=@@@%@@@..@..@@@%@@@=@@@@@@@@@@+@@
               :@+@#+@@=.@@@..@@@@@@...=.%..=@@@@@@..@@@==@@+.@=@
               @@@+=@%=@@#@=@@@.=.:@.#.:=:.:.%....@@@=@+@@=@@==@@@
              %%==@@@@+%.=@@.....@@.@:*.+=.:%.@@....=@%=+@=*@@@==@=
             #=@=@===%@@@@.@@.@@@#.++.=*#.@:=+.@@@@+@@=@@@@#==+@=@#=
             .@=@@=@@@#%.%*..@@@=@++#:@=*=%:+++@.@@@==*#+%%@@@=@@+%*
            =@@@*=@@=#@=@@@@@@.@@....*:@.@:+....@@*@@@@@@=@==@@==@@.
            @@%@@@=.=.@@=@@@@@@*......=::=........@@@@@*=@#=..=@@@#@:
            @@@@=@@@@@@@@@@@@@..........+..........@@@@@@@@@@%@@=@@@@
            @%@@@%@@@@@@@@@@@@@@.......**+......%@@@@@@@@@@@@@@@@@@%@
           @@@@@%@@@@%@@@@@@@@@@@@.....*++.....@@@@@@@@@@@@@@@@@@@@@@:
           @@%@@@@@@@@@@@...............+...............@@@@%@@@@@@@@#
           %@@@@@@@@@@@...................................@@@@%@@@@@@
      ==.  +@@%@@@@@@.......................................@@@@@@@@@   .==
      =====..@@@@@@@.....@@@@@@@.................@@@@@@@.....@@@@@@@..=====
      =====...=.@@@@.#.@@@  @@@@@...............@@@@@  @@@.#=@@@@.......==+
       ====++.....@@..@@:  @@@@@@@.............@@@@@@@  @@@.=@@.....++===+
        ..=+==+...@@...@  +@@@@@@@@...........@@@@@@@@+  @..=@@...+======
         =..===+..@@...@  +@@@@@@@@............@@@@@@@+  :..*@@.++==+..+
          +...=+=+@@....   @@@@@@@ ........... @@@@@@@  +...@@@+=+....+
           =...+==@@@....   @@@@@ ............  @@@@@  .....@@@==+....
           .@...+=@@@..=====:   .......=+........   :=====..@@@==...@
           #@@=...@@@.========............+........=======.#@@@...=@@
           @@@@@.==@@@.=======........=..*.........=======.@@:...@@@@
           @@@@@@@=@@@..=====.......................=====.%@@@=@@@@@@
           @#@@@@@=@@@@............=.........=............@@@@.@@@@@@#
           @.@@@@.*@@%@@.............=.===.=.............@@@@.=@@@@@@@
          =@:@@@@=*=@%@@@..............................@@@@@@.+@@@@@%@
          @%@@%@@@=@@%@@@@@...........................@@@@@@@*=@@@@@.@
          @.@@@@@@@@@%@@@@@@@.......................@@@@@@@@@@@@@%@@.@@
         #@.@@@@@@@@@%@@@@@@@@@@.................@@@@@@@@@@@@@@@@@@@+=@
         @..@@@@@@@@@%@@@@@@@@@@@@@++......=++@@@@@@@@@@@@@@@@@@@@@@@.@:
        *@.@@%@@@@@@@%@@@@@@%@@@@@@=++++++++==@@@@@@@@@@@@@@@@@@@@@@@..@
        @..@@@@@@@@@@@@@%@@@@@@%%%%======+===+%%%%@@@@@@@@@%@@@@@@%@@@.@.
       @..@@%@@@@@@@@@@@@@@@@%%%%%%==========.%@@%%%@@@@@@@%@@%@@@@@@@.:@
       @.:@@@@@@@@@@%@@@@@@@%%%%%@.............@%#%%%@@@@@@@@@@@@@@%@@@.@@
      @..@@%@@@@@@@@@@@@###%%%%#%...............%%%%%#%%@@@@%@@@@@@@@@@..@
     %*.%@@@@%@@@@@@@@@%%%#%#%%%#%.............#%#%%#%%#%@@@@@@@%@@@%@@@..@
     @..@@%@%@@@@@@@%%%#%%%%#%#%%#%...........###%%#%%#%%##%@@@@@@@@@%@@%.#@
    @..@@%@@@@@@@@%%%#%##%##%%#%%%%%%.......%%%%#%%#%%%#%%%#%#@@@@@@@@@@@..@=
   :%.@@@@@%@@@@@@%%%#%#%%%%%#%%#%%%%%.:%:.%%%##%%%%%%%#%###%%%@@@%@@@%@@@..@
   @..@@@@%@@@@@@#%%%%%%%%%%%#%%#%%%%.%.:.%.%%%%%%%%#%%%%%%%%%#%@@@@@@@@@@%.:@
  .#.@@%@@@@@@@@######%%%%##%%%%%%%%%:%:%.%:%%%%%%%%%#%%%##%%###@@@%@@@@@@@..@.
  @.@@@@@%@@@@@%%%%#%%%%%%%#%%%##%%%%.%::#%:%%%%%%%%%%%#%%%#%##%#@@@@@@@%@@@..@
  @.@@@@@@@@@@@%#%#####%%%%###%##%%%%%+.:.#%%#%%%%##%%#%%#%#+#%#%%@@@@@@@@@@%.@
  %@@@@@%@@@@@%#%%##%###%%#%%%%%%##%%%+++=+%%%#%%%#%%%#%%#%####%%%@@@@@@@%@@@.:@
  .@@@@@@@%@@@###%%##%+#%%%#####%#%%%.=:=: .%%%#%%%%#%%%%%#%#%%#%%%@@@@%@@@@@@.@
  @@@@@%@@@@@%%%%#%%#%###%###%%%#%%%.=:=:+: *%%%%%%###%%##%%#%##%%#@@@@%@@%@@@.@
  @@@@@@@@@@@#%%#%#%##%###%%%%%##%%=:+: =+:+:+#%%%%###%+%#%#####%%%#@@@@@@%@@@@@
 @@@@@@@@%@@#######%%##%#%#%%%%%%%. ::+:* +:::.#%%%#%%%#%#%%%%#%###%@@@@@@@@@@@@
 @@.@@%@@@@@%#%##%%%%###%%%#%%%#%. :::. *:.:: :.###%#%%%%#%###%%#%%##@@@@@@@@@@*
.@@#@@%@@@@%#%%%%%###%###%#%%%#*. :::: .  :::: :.%%#%##%###%%####%%%#@@@@@@@@.@@
@@@@@@%@@@@###%%%%%#####%%####+.: :::: :.::: : ::.%%%%#%###%#%#%%####%@@@@@@@ @@
@@%@@@@@@@@%###%%#%#%#%%#%=##+.:  ::  : .  : :: : .#%%+%%#%%#%#%##%#%#@@@@@%@ @@
@@+@@@@@@@#%###%%##%%%###*###.:  :: ::::.:  :: ::::.%##=#%%#%###%#%%#%@@@@@@@ @@
@@=#@@%@@@##%%#%%##%%#%=#%#**: ::: :: ::.::: ::::::.*%%##%#%%#%#%%#####@@@@@@ @@
#@ :@@@@@%%%%#+%####%=*%###*+= :::::: ::.:::::  :::+**###%=##%%##%##%%#@@@@@@ @@
 @@ @@@@@%%%%%####%=##%%#%*+++::::: ::: . : ::   ::++**%%%%#=%#%##%%##%@@@@@@ @@
 @@ %@@@@#%#%##%..%#####%%*+++*: :  :  :. : ::.:::++++#@##%%##==#%%##%##@@@@ #@
  @# @@@#%#%##=#%=%#####%%#++++:: .: :: .::::. : :++++%%%%%##%#%%+##%%##@@@@ @@
   @  @@%%#%%%%*+%%%%##%%@%%%*++:::.::: .:: .::::+++#%%@@@#%%###=%#%##%#@@@ @@
    ** @%%%=##%%#%%#%%@@@@@%%%%%%%%%%%%%%%%%%%%%%%%%%%%@@@*%#%#%#%##=%%%%@ @
       ===%%####%####*@@@@@%%%#***#*****#*********#%%%%%@@**#%#%%%%#%%===
       %%%%%%#%#####**@%%@@%%%%%#%%#%%%%%%%%##%%#%%%%%@@@%***%%%###%%#%%%
      .%%%%%%#%%%#%***%@%@@%%%%%##%%##%**%%##%%##%%%%%@@@@****%%#%#%#%%%%#
      %%%%%%%%####*++++@@%##%%%#%%%#%%%%%%%%%#%%##%%%#@%@++++*#%%#%%#%%%%%
      %%%%#%%#%%%*++++++#@%%%%******###%%%%####**#*%%%%%++++++*%#%%##%%%%%
     =%%%%%##%%%%*++++++++%%%%%%%%%%%%%%%%%%%%%%%%%%%%%++++++++*#%%%##%%%%.
     %%%%%######*++++++++++%%%%@%%@@%%%%%%%%%@%%%@%%%++++++++++*###%%%%%%%%
     %%%%###%#%**++++++++++++%%%%@%%@%+*+*%%@%%%%%@%+++++++++++**#%#%##%%%%
    .%%%%%#%#%#%++++++++++++++%%@%%@%%*+++=%%@%%@%=++++++++++++*@%%%#%%%%%%
    %%%%%%#%##%@@@+++++++++++++%%%@%%=+++++*%%%%%+++++++++++++@@%%%##%#%%%%%
    %%%%%###%#@%%%%++++++++++..+.%%%*.+++++.*%%+++=++++++++++%%%@%##%###%%%%
   :%%%%#####%%%%@@%@+++++++.++=++.++.++=++.*+=+=.+.+++++++#@%%%@@@##%%%%%%%
   %%%%%%%#%%%@%@%@@@#+++++=++.+......=+++=.....++.+=+++++#@%@%@%%%%#%#%#%%%*
   %%%%%#%##%%%@@%%@####+++.+=+=.................+.+.+++####@%%@%%@#%%%#%%%%%
   %%%%%#%%%#@@%%@%%######+.+=+.......=..........=+.++*#####%%%@%@@%##%%%%%%%
  %%%%%%%%%##@@@@@@########*+==.....+...+........++.*########%@@@@%%#%#%%%%%%.
  %%%%%#%#%#%%%@%%%##########++....+..+..........==####**####@@@@%%@%#%%#%%%%%
  %%%%%%%%%#%%%%%@#*###****####...=......+......:#*******+##*#%@@%%%%%##%%%%%%
 %%%%%%%%%#%@%%@@%***#*****#**#..+.+...+...+..+:::***#*+**#**#%%%%%@%###%%%%%%.
 %%%%%##%%##@%@@%#***#*********:.:+..+...+..++=.::***#****#*#**@@%%@%##%#%%%%%%
 %%%%%#%%#%#%%%%@*#*#*****#****::.:::+++=+=++..:: *********##**@@@@%%%#%%#%%%%%
+%%%%%##%##%%@@@@#**#*********=    :::::..:... :::****#****#***%@@@%%###%%%%%%%
%%%%%%%%%##@%@%@#**##*********  :::.::: .: :.   ::****#**+*##**#%@@%@#%%%%%%%%%%
%%%%%#%#%%%%@%%%***#**********:::::.: :: :  :::    *********#*##%@%@@%%%%#%%%%%%
%%%%%%%%##%%%%@%#*##**********: :::.   : ::::: :  :******+**###*@@%%@%##%#%%%%%%

===[ INTRO ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The delving intensifies!

In this article, we'll look at kinda time-travel debugging and how to watch a
physical memory address in QEMU.

===[ Time-travel Debugging ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Time-travel debugging is one of those techniques that can exponentially speed
up debugging or reversing (especially when working with memory bugs). If you
haven't heard about it, time-travel debugging is the process of stepping back
in time through code to understand what's happening during a program's
execution. [ref1]

Picture a binary that crashes with a segmentation fault. When we inspect the
core dump, we might see something like SIGSEGV at address 0 and a broken stack.
How did it get there when the backtrace is useless? What if we could step a few
instructions backward? And not just that. What if we could set a watchpoint on
an address, hit ''reverse-continue'', and instantly find the last instruction
that touched that memory location? That's the power of time-travel debugging.
Pretty nifty, right? But we wont't use it here...

===[ Broken Time Machine ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

QEMU has built-in record/replay functionality [ref2] [ref3] [ref4], which can
be used for deterministic replay of binary execution. Unfortunately, as of the
time I'm writing this, there's a bug that makes it pretty much unusable: replay
operations (like ''reverse-step'') load the nearest snapshot, but QEMU hangs if
there's any snapshot other than the initial one [ref5]. That actually makes it
slower than re-running the whole emulation (because of icount replay). But
there's still one way to time-travel with QEMU.

Snapshots, as they are, are actually a very useful feature since they save the
full state of the running emulation (= vCPU, RAM, and devices) and, most
importantly, preserve breakpoints when a snapshot is loaded. They're far from
being a full replacement for record/replay functionality, and we can definitely
say goodbye to determinism. But they're still an excellent tool for debugging,
like we're doing here.

Since snapshotting is available in QEMU, we can run a kernel emulation until we
hit ''load_elf_binary'', take a snapshot using ''savevm'', and then continue
debugging. If we hit a dead end or find an interesting address (or error), we
can set a breakpoint or watchpoint, load the snapshot using ''loadvm'', and
keep going until we hit that breakpoint or watchpoint.

On top of that, we can call QEMU ''monitor'' commands, such as ''savevm'',
directly from GDB [ref6] [ref7], as we'll see later in the article.

(Also, see the notes on time-travel debuggers in [[#OUTRO]].)

===[ BUG #4: Chef's Kiss ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's continue the BUGs series from the previous article [ref8] and take a
masterclass in debugging. And yes, it's RAW! [ref9]

The source code is available either in [ref8] or here: [[/data/2025/elf64-fixme.nasm|elf64-fixme.nasm]].

$ ./elf64-fixme
Segmentation fault (core dumped)

This segfault is a real specialty. When we look at ''strace'', it looks similar
to ''BUG #1'' [ref8], where the code jumped out of the mapped memory region:

$ strace ./elf64-fixme
execve("./elf64-fixme", ["./elf64-fixme"], 0x7ffe953f0e70 /* 42 vars */) = 0
read(0, NULL, 0)                        = 0
--- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=0x1} ---
+++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)

But this one is different. This time, there's no bug in our user space code
causing the memory violation. Let's inspect it closely in GDB:

$ gdb -ex 'file ./execve_wrapper' -ex 'catch exec' -ex 'run ./elf64-fixme'

(gdb) x/20i $rip
=> 0x12eb00000001:      rex.RB
   0x12eb00000002:      rex.WR
   0x12eb00000003:      rex.RX syscall
   0x12eb00000006:      mov    cl,0x48
   0x12eb00000008:      push   rcx
   0x12eb00000009:      nop
   0x12eb0000000a:      nop
   0x12eb0000000b:      nop
   0x12eb0000000c:      nop
   0x12eb0000000d:      nop
   0x12eb0000000e:      nop
   0x12eb0000000f:      add    eax,0x3e0002
   0x12eb00000014:      pop    rsi
   0x12eb00000015:      mov    dl,0xe
   0x12eb00000017:      mov    eax,0x1
   0x12eb0000001c:      add    BYTE PTR [rax],al      ; <-- 00 00
   0x12eb0000001e:      add    BYTE PTR [rax],al      ; <-- 00 00
   0x12eb00000020:      add    BYTE PTR [rax],al      ; <-- 00 00
   0x12eb00000022:      add    BYTE PTR [rax],al      ; <-- 00 00
   0x12eb00000024:      add    BYTE PTR [rax],al      ; <-- 00 00

Did you notice something? Our code is missing starting at ''0x12eb0000001c''.
That line should contain ''jmp short 0x30'', but instead there are zeros
(''add BYTE PTR [rax],al'' is ''00 00'' in hex). What's happening? It shouldn't
be a broken mapping, since mappings work at page granularity. Did the kernel
stop copying the full code for some reason? Or is something zeroing it out?

Let's check the hexdump, because it looks cool:

(gdb) x/80xb 0x12eb00000000
0x12eb00000000: 0x7f    0x45    0x4c    0x46    0x0f    0x05    0xb1    0x48
0x12eb00000008: 0x51    0x90    0x90    0x90    0x90    0x90    0x90    0x05
0x12eb00000010: 0x02    0x00    0x3e    0x00    0x5e    0xb2    0x0e    0xb8
0x12eb00000018: 0x01    0x00    0x00    0x00    0x00    0x00    0x00    0x00
                        ^-- it might start here => offset 0x19
0x12eb00000020: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x12eb00000028: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x12eb00000030: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x12eb00000038: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x12eb00000040: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00
0x12eb00000048: 0x00    0x00    0x00    0x00    0x00    0x00    0x00    0x00

What the hexdump tells us is that the zeroing might start at
''0x12eb00000019'', but it's definitely happening at ''0x12eb0000001c''.
Therefore, we want to watch the offset ''+0x1c'' to be sure.

Also, this isn't something we want to debug from user space, since we have no
clear clue which function is responsible. We could theoretically find out using
''ftrace'' and ''kprobes'', but that would be unnecessarily daunting. Let's use
QEMU and watch memory accesses instead.

===[ Physical vs. Linear Memory ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Before we start QEMU, what's our strategy? What approach should we take?
Stepping through instructions one by one would be too time-consuming. And if we
find an interesting function to break on, it might be too late, and we'd have
to restart the whole process.

We could set a watchpoint on memory access, but there's a catch! Watchpoints
track a *virtual* (linear) memory address. That means we won't be able to catch
all accesses, because the kernel changes virtual memory mappings based on
context. For example, kernel space has a different virtual mapping than a user
space program.

Here's an extremely simplified description of what happens when an executable
binary is loaded into memory on x86-64:

1. After boot, the kernel enables long mode and paging (CR0 is set and the MMU
is enabled). From that point on, the CPU operates on virtual (linear) memory
addresses. This means addresses are translated from virtual to physical by the
MMU.

2. When ''load_elf_binary'' loads parts of the binary into memory and parses
them, it builds a new process image based on the ELF headers (= it maps the
''PT_LOAD'' segments of the binary into the process's virtual address space).

3. The binary is effectively stored in physical memory (and has a physical
address), but that address differs from the virtual address used by the kernel,
which in turn differs from the process's virtual address. So we now have at
least three sets of memory addresses: physical memory, kernel working memory,
and user space process memory. And all of them will almost certainly differ
from each other. (Also, there can be different addresses when memory is shared,
but we won't go into that here.)

4. And this is where the fun part starts. Memory mappings are governed by the
MMU. When the CPU works with an address, it uses a virtual address, but when
the data needs to be stored to or fetched from memory, the CPU delegates the
operation to the MMU, which translates the virtual address to the corresponding
physical address. This means x86 CPU debug registers ''DR0--DR3'' compare the
virtual (linear) address, not the physical address.

In summary, all this means that when we set a watchpoint, it triggers only when
the CPU reads or writes to the specified *VIRTUAL* address, and that address
can differ depending on the process context (= kernel/user). As a result, we
won't be notified in all cases! (Also, we shouldn't ignore operations that can
bypass the CPU entirely, such as DMA writes to memory.)

NOTE: Maybe you've heard of the ''maintenance packet Qqemu.PhyMemMode:1'', so
why not use it? Because it only switches reads and writes to physical memory,
watchpoints still use virtual (linear) addresses.

===[ Thanks for the Memory ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

That's grim. How can we watch the data of interest when the virtual address can
change during execution? Well, actually there's one dirty trick to watch a
physical address in an emulator.

QEMU has a command that translates a physical memory address of the emulated
program to the virtual memory address of the running QEMU process on the host:
''gpa2hva'' (guest physical address to host virtual address). You can probably
see where this is going -- we'll translate any physical addresses we're
interested in to their corresponding host addresses and then set a watchpoint
in QEMU's host virtual address space.

LINUX KERNEL  <-----  VIRTUAL HW  <----- QEMU <----- HOST HW
                      (GDB STUB)          ^
                          ^               |
                          |               '--------- gdb-host
                      gdb-guest

Don't worry about that for now. Let's start small by creating emulated physical
memory that we can easily access. Start QEMU with ''memory-backend-file'',
attach GDB (we'll call it gdb-guest </foreshadowing>), and set a breakpoint on
''load_elf_binary'' (see part 1 for the full QEMU + GDB setup [ref10]):

$ echo 'elf64-fixme' | cpio --quiet -H newc -o | gzip -5 -n > ./initrd.gz

$ qemu-img create -f qcow2 snap.qcow2 1G

$ qemu-system-x86_64 -accel tcg -smp 1 -S -monitor stdio \
  -object memory-backend-file,id=mem,size=512M,mem-path=/dev/shm/mem,share=on \
  -machine memory-backend=mem -gdb tcp:127.0.0.1:1234 \
  -kernel vmlinuz-6.1.0-35-amd64 -initrd initrd.gz \
  -append 'nopti nokaslr console=tty0 console=ttyS0,115200 rdinit=/elf64-fixme'

$ gdb -ex 'file /temp/elf/vmlinux-6.1.0-35-amd64' \
      -ex 'target remote 127.0.0.1:1234' \
      -ex 'hbreak load_elf_binary' \
      -ex c

NOTE: When using QEMU snapshots in a diskless virtual machine, we need to add a
qcow2 storage device [ref2] to store the snapshots (unless we want to use
file-based migration [ref11], which I don't).

NOTE: When booting the Linux kernel, it can sometimes be useful to make it more
deterministic (at least for initial debugging) by disabling randomization (like
KASLR -- Kernel Address Space Layout Randomization [ref12]) and mitigations
(like PTI -- Page Table Isolation [ref13]). In this case, we won't need it
much, since we'll be debugging the kernel part close to the user space program.
(But it won't hurt.)

Wait for the breakpoint to trigger in gdb-guest, then locate all instances of
the binary in the emulated kernel's physical memory. We can use ''rafind2''
from the radare2 toolkit [ref14]:

$ i="$(xxd -ps elf64-fixme)"

$ rafind2 -x "$i" /dev/shm/mem
0x3305000
0x6000ca0

The binary appears at two physical locations, and we need to watch both since
we don't yet know which one is being copied or shredded.

We could translate both addresses into virtual ones by adding
''page_offset_base'' (= the direct mapping of all physical memory [ref15]) and
then watch those addresses. But this runs into the same problem with linear
addresses we discussed in [[#Physical vs. Linear Memory]]. What we really want
is to watch access to the physical addresses.

===[ From Physical Guest to Virtual Host ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

It's time for us to implement the "dirty trick" we talked about at the
beginning of [[#Thanks for the Memory]].

First, we need a way to translate the physical addresses of the emulated kernel
into the virtual addresses of the QEMU process. Luckily, we can do that with
the QEMU monitor command ''gpa2hva'' [ref16]:

(qemu) gpa2hva 0x3305000
Host virtual address for 0x3305000 (mem) is 0x7ff787304000

(qemu) gpa2hva 0x6000ca0
Host virtual address for 0x6000ca0 (mem) is 0x7ff789fffca0

Now we can start a new GDB instance and attach it to the QEMU process (we'll
call it "gdb-host"). Then, set watchpoints on the addresses ''0x7ff787304000''
and ''0x7ff789fffca0''. Don't forget to add the offset ''0x1c'', since the
"zeroing" starts roughly there.

$ gdb -p "$(pgrep qemu)" -n -q

(gdb-host) awatch -l *(0x1c + 0x7ff787304000)
Hardware access (read/write) watchpoint 1: *0x7ff787304000

(gdb-host) awatch -l *(0x1c + 0x7ff789fffca0)
Hardware access (read/write) watchpoint 2: *0x7ff789fffca0

(gdb-host) c

NOTE: Watch out for any special configuration in gdb-host. For example, the
''dashboard'' plugin [ref17] can cause an infinite loop in GDB!

===[ We Have to Go Derper ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Remember: when doing weird things, don't be afraid to go all the way.

What if I told you that QEMU has a "stop the emulation" function called
''vm_stop()'' [ref18], and that GDB can inject itself into the debuggee
process to invoke its functions? What would you do?

int vm_stop(RunState state)

We won't go into the ''RunState'' data type here. Let's just say we can wake up
gdb-guest by calling the function with the argument ''0'' (= ''RUN_STATE_DEBUG'')
[ref19].

(gdb-host)
Thread 3 "qemu-system-x86" hit Hardware access (read/write)
watchpoint 1: -location *(0x1c + 0x7ff787304000)

(gdb-host) call (int) vm_stop(0)
(gdb-host) c

Calling ''vm_stop(0)'' notifies the debugger connected to QEMU's GDB stub
(gdb-guest) with a ''SIGTRAP'' signal. When gdb-guest receives the signal,
it pauses, allowing us to collect the stacktrace, registers, and other data:

(gdb-guest)
Program received signal SIGTRAP, Trace/breakpoint trap.

===[ Innit, automate ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Noice! ''vm_stop(0)'' works as expected, but every access to those addresses
(even simple memory inspections in gdb-guest) halts gdb-host. For each such
access, we have to switch from gdb-guest to gdb-host and hit ''continue''.
That's tedious, but fortunately, we can easily automate ''vm_stop'' in
gdb-host:

(gdb-host) Thread 3 "qemu-system-x86" hit Hardware access (read/write)
watchpoint 1: -location *(0x1c + 0x7ff787304000)

(gdb-host) commands 1 2
call (int) vm_stop(0)
c
end

(gdb-host) c

''commands 1 2'' means: execute the following commands for breakpoints or
watchpoints numbered ''1'' and ''2'' (= the ones we defined earlier).

After hitting ''continue'' in gdb-host, we also need to run ''continue'' in
gdb-guest. But before that, save a VM snapshot -- it will come in handy later:

(gdb-guest) monitor savevm load_elf_binary
(gdb-guest) c

The moment we run ''continue'' in gdb-guest, every time anything in the
emulation accesses those addresses, we get notified in gdb-host. However,
that's not very useful, since we only see internal QEMU functions, not the
guest kernel we're debugging. We need a way to notify gdb-guest, where the
actual debugging happens.

Now we can step through gdb-guest uninterrupted until we find a new address to
watch.

===[ Red Herring ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Time to focus on what we should be focusing on. If we lose sight of our goal,
we'll be burning time on unnecessary side quests. Let's see an example...

Our first stop is the ''memcpy'' function:

(gdb-guest) bt 1
#0  memcpy () at arch/x86/lib/memcpy_64.S:40

(gdb-guest) disassemble
Dump of assembler code for function memcpy:
...
   0xffffffff81a36d82 <+18>:    rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
=> 0xffffffff81a36d85 <+21>:    mov    ecx,edx
...

(gdb-guest) info registers rdi rsi
rdi            0xffff888005fffeb8
rsi            0xffff888003305050

''REP'' repeats a string instruction (e.g., ''movs'') until ''RCX'' reaches
''0''. Each iteration decrements ''RCX'' and increments or decrements
(depending on the direction flag ''DF'') the values in ''RDI'' and/or ''RSI''
by the element size (''QWORD'' -- 8 bytes) [ref20].

In this case, it copies 8-byte chunks from ''0xffff888003305050 - size'' to
''0xffff888005fffeb8 - size''. The source address ''0xffff888003305050''
corresponds to the physical address ''0x3305000'', after subtracting the number
of bytes already copied and the page offset:

  src_phys =
    = RSI - page_offset_base - size =
    = 0xffff888003305050 - 0xffff888000000000 - 0x50 =
    = 0x3305000

So what is copied by the ''memcpy'' function?

(gdb-guest) x/80xb $rdi - 0x50
0xffff888005ffefe8:     0xf1  0x2c  0x00  0x00  0x00  0x00  0x00  0x00
0xffff888005ffeff0:     0x52  0x00  0x00  0x81  0x00  0x00  0x00  0x00
0xffff888005ffeff8:     0x00  0x00  0x00  0x00  0x00  0x00  0x00  0x00
0xffff888005fff000:     0x00  0x00  0x00  0x00  0xeb  0x12  0x00  0x00 <--
0xffff888005fff008:     0x18  0x00  0x00  0x00  0x00  0x00  0x00  0x00
0xffff888005fff010:     0x18  0x00  0x00  0x00  0xeb  0x12  0x00  0x00
0xffff888005fff018:     0x0f  0x05  0xb0  0x3c  0x0f  0x05  0x38  0x00
0xffff888005fff020:     0x01  0x00  0x00  0x00  0x00  0x00  0x00  0x00
0xffff888005fff028:     0x01  0x00  0x00  0x00  0x00  0x00  0x00  0x00
0xffff888005fff030:     0x57  0x4f  0x52  0x4b  0x49  0x4e  0x47  0x0a

(gdb-guest) x/80xb $rsi - 0x50
0xffff888003305000:     0x7f  0x45  0x4c  0x46  0x0f  0x05  0xb1  0x48
0xffff888003305008:     0x51  0x90  0x90  0x90  0x90  0x90  0x90  0x05
0xffff888003305010:     0x02  0x00  0x3e  0x00  0x5e  0xb2  0x0e  0xb8
0xffff888003305018:     0x00  0x00  0x00  0x00  0xeb  0x12  0x00  0x00 <--
0xffff888003305020:     0x18  0x00  0x00  0x00  0x00  0x00  0x00  0x00
0xffff888003305028:     0x18  0x00  0x00  0x00  0xeb  0x12  0x00  0x00
0xffff888003305030:     0x0f  0x05  0xb0  0x3c  0x0f  0x05  0x38  0x00
0xffff888003305038:     0x01  0x00  0x00  0x00  0x00  0x00  0x00  0x00
0xffff888003305040:     0x01  0x00  0x00  0x00  0x00  0x00  0x00  0x00
0xffff888003305048:     0x57  0x4f  0x52  0x4b  0x49  0x4e  0x47  0x0a

Even though it looks suspicious and seems worth investigating, it's a trap.
Just look at it -- no zeros we're hunting for. It's simply copying the ELF
program headers into a buffer. Here's the backtrace:

(gdb-guest) bt
#0  memcpy
#1  _copy_to_iter
#2  copy_page_to_iter
#3  shmem_file_read_iter
#4  __kernel_read
#5  kernel_read
#6  elf_read
#7  load_elf_phdrs          <---
#8  load_elf_binary
...

It might be related, but first let's find the zeroing. We can return to this
later since we have the snapshot.

===[ No BRK for You Tonight ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Leaving ''memcpy'' behind, what's next?

(gdb-guest) c

Program received signal SIGTRAP, Trace/breakpoint trap.
set_brk (start=start@entry=20800526614553, end=end@entry=20800526614554,
prot=prot@entry=6) at fs/binfmt_elf.c:113

Now it's getting interesting. A watchpoint triggered inside ''set_brk'' [ref21],
and its arguments look a bit strange:

(gdb-guest) info registers rdi rsi rdx
rdi            0x12eb00000019             ; start
rsi            0x12eb0000001a             ; end
rdx            0x6                        ; prot

Could the "program break" (brk [ref22]) be mangling our data? Frankly, I doubt
it. In extreme cases, ''brk'' should only allocate or deallocate heap memory.
I'd be surprised if it triggered zeroing in the kernel. Next please:

(gdb-guest) c

Program received signal SIGTRAP, Trace/breakpoint trap.
copy_page () at arch/x86/lib/copy_page_64.S:20

===[ Prepare Your Diddly Hole ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

''copy_page'' sounds like the kernel is preparing our binary for execution:

(gdb-guest) disassemble
Dump of assembler code for function copy_page:
   0xffffffff819e4d60 <+0>:     xchg   ax,ax
   0xffffffff819e4d62 <+2>:     mov    ecx,0x200
   0xffffffff819e4d67 <+7>:     rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]

(gdb-guest) x/80xb $rsi-(0x200*8)
0xffff888003305000:     0x7f   0x45   0x4c   0x46   0x0f   0x05   0xb1   0x48
0xffff888003305008:     0x51   0x90   0x90   0x90   0x90   0x90   0x90   0x05
0xffff888003305010:     0x02   0x00   0x3e   0x00   0x5e   0xb2   0x0e   0xb8
0xffff888003305018:     0x01   0x00   0x00   0x00   0xeb   0x12   0x00   0x00
0xffff888003305020:     0x18   0x00   0x00   0x00   0x00   0x00   0x00   0x00
0xffff888003305028:     0x18   0x00   0x00   0x00   0xeb   0x12   0x00   0x00
0xffff888003305030:     0x0f   0x05   0xb0   0x3c   0x0f   0x05   0x38   0x00
0xffff888003305038:     0x01   0x00   0x00   0x00   0x00   0x00   0x00   0x00
0xffff888003305040:     0x02   0x00   0x00   0x00   0x00   0x00   0x00   0x00
0xffff888003305048:     0x57   0x4f   0x52   0x4b   0x49   0x4e   0x47   0x0a

(gdb-guest) x/80xb $rdi-(0x200*8)
0xffff8880029fe000:     0x7f   0x45   0x4c   0x46   0x0f   0x05   0xb1   0x48
0xffff8880029fe008:     0x51   0x90   0x90   0x90   0x90   0x90   0x90   0x05
0xffff8880029fe010:     0x02   0x00   0x3e   0x00   0x5e   0xb2   0x0e   0xb8
0xffff8880029fe018:     0x01   0x00   0x00   0x00   0xeb   0x12   0x00   0x00
0xffff8880029fe020:     0x18   0x00   0x00   0x00   0x00   0x00   0x00   0x00
0xffff8880029fe028:     0x18   0x00   0x00   0x00   0xeb   0x12   0x00   0x00
0xffff8880029fe030:     0x0f   0x05   0xb0   0x3c   0x0f   0x05   0x38   0x00
0xffff8880029fe038:     0x01   0x00   0x00   0x00   0x00   0x00   0x00   0x00
0xffff8880029fe040:     0x02   0x00   0x00   0x00   0x00   0x00   0x00   0x00
0xffff8880029fe048:     0x57   0x4f   0x52   0x4b   0x49   0x4e   0x47   0x0a

The binary is being copied correctly, one-to-one.

NOTE: ''copy_page_64()'' copies ''0x200'' QWORDs, which equals one 4 KiB page
(''0x200 * 8 = 0x1000 = 4096''; who would have guessed). The breakpoint stopped
at the following instruction, so the copying is complete. Therefore, ''rsi''
and ''rdi'' now point one page past the region of interest. That means, we need
to subtract the page size (''0x1000'') from the addresses.

Do we have a new physical address?

(gdb-guest) p/x $rsi - (0x200 * 8) - page_offset_base
$1 = 0x3305000

(gdb-guest) p/x $rdi - (0x200 * 8) - page_offset_base
$2 = 0x29fe000

Yes, we do: ''0x29fe000''.

NOTE: We could also use the QEMU function ''gva2gpa'':

(qemu) gva2gpa 0xffff8880029fe000
gpa: 0x29fe000

NOTE: Don't forget to subtract the correct page size, ''0x1000''. (I forgot
several times and wondered why the offset didn't match.)

We can (and should) verify both addresses by searching for the data in the
physical memory file, as we did earlier in [[#Thanks for the Memory]]:

$ rafind2 -x "$i" /dev/shm/mem
0x29fe000   <-- the new physical address <--.
0x3305000   --> Copying the binary to     --'
0x6000ca0

===[ Watching a New Address on the Host ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now compute the virtual address of ''0x29fe000'' so that we can monitor it in
gdb-host:

(qemu) gpa2hva 0x29fe000
Host virtual address for 0x29fe000 (mem) is 0x7ff7869fd000

What's left is to set a watchpoint for the new address in gdb-host. Again,
don't forget the offset!

NOTE: gdb-host runs continuously (thanks to our "command" script). We can
interrupt it with ''CTRL-C'', then set a new watchpoint and its corresponding
"command" script.

(gdb-host)
^C

(gdb-host) awatch -l *(0x1c + 0x7ff7869fd000)

(gdb-host) commands
call (int) vm_stop(0)
c
end

(gdb-host) c

===[ From Heroes to Zeroes ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Back to gdb-guest to find out what's rewriting our data:

(gdb-guest) c

Program received signal SIGTRAP, Trace/breakpoint trap.
clear_user_rep_good () at arch/x86/lib/clear_page_64.S:142

(gdb-guest) disassemble
Dump of assembler code for function clear_user_rep_good:
...
   0xffffffff819e49ce <+14>:    rep stos QWORD PTR es:[rdi],rax
=> 0xffffffff819e49d1 <+17>:    and    edx,0x7

(gdb-guest) info registers rdi rax
rdi            0x12eb00000ff9
rax            0x0

Ha! That looks promising. ''rep stos'' is effectively a memset implementation.
It stores ''RCX'' QWORDs of the ''RAX'' value into the memory address at ''RDI'',
while ''rep'' increments ''RDI'' and decrements ''RCX''.

Not only does it write a lot of zeros, but it's called from a function named
''clear_user'':

(gdb-guest) bt
#0  clear_user_rep_good
#1  __clear_user
#2  clear_user
#3  padzero                 <---
#4  load_elf_binary
...

Before we investigate it, just to be sure that there is no other write:

(gdb-guest) c

Program received signal SIGTRAP, Trace/breakpoint trap.
0x000012eb00000006 in ?? ()

(gdb-guest) x/20i $rip
=> 0x12eb00000006:      mov    cl,0x48
   0x12eb00000008:      push   rcx
   0x12eb00000009:      nop
   0x12eb0000000a:      nop
   0x12eb0000000b:      nop
   0x12eb0000000c:      nop
   0x12eb0000000d:      nop
   0x12eb0000000e:      nop
   0x12eb0000000f:      add    eax,0x3e0002
   0x12eb00000014:      pop    rsi
   0x12eb00000015:      mov    dl,0xe
   0x12eb00000017:      mov    eax,0x1
   0x12eb0000001c:      add    BYTE PTR [rax],al
   0x12eb0000001e:      add    BYTE PTR [rax],al
   0x12eb00000020:      add    BYTE PTR [rax],al
   0x12eb00000022:      add    BYTE PTR [rax],al
   0x12eb00000024:      add    BYTE PTR [rax],al
   0x12eb00000026:      add    BYTE PTR [rax],al
   0x12eb00000028:      add    BYTE PTR [rax],al
   0x12eb0000002a:      add    BYTE PTR [rax],al

Yup, that's our user space binary with zeroed data. It really looks like
''clear_user'' is the one we're after.

===[ Going Back ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's set a breakpoint on the ''clear_user'' function and roll back using our
snapshot from [[#From Physical Guest to Virtual Host]]:

(gdb-guest) hbreak clear_user
(gdb-guest) monitor loadvm load_elf_binary
(gdb-guest) c

NOTE: When we load the snapshot, QEMU immediately restores all its states
(vCPUs, registers, memory, etc.) => the instruction pointer (''RIP'') points to
the first instruction of ''load_elf_binary'', where we took the snapshot.
HOWEVER! GDB isn't aware of this yet, so it still holds the old data because
it caches values like registers. To view the correct registers right after
''loadvm'', flush the cache with: ''maintenance flush register-cache'' [ref23].

(gdb) info registers rip
rip            0xfff0

(gdb) monitor loadvm snap

(gdb) info registers rip
rip            0xfff0

(gdb) monitor info registers
...
RIP=ffffffff813e2c80
...

(gdb) maintenance flush register-cache
Register cache flushed.

(gdb) info registers rip
rip            0xffffffff813e2c80  0xffffffff813e2c80 <load_elf_binary>

===[ Inlining ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now it's time for my favorite game: figuring out what arguments we're dealing
with when a function has been inlined.

The prototype for ''clear_user'' is [ref24]:

static __always_inline unsigned long clear_user(void __user *to, unsigned long n)

When we hit the breakpoint and follow the kernel calling convention, we'll see
a rather odd ''n'' argument, such as:

Breakpoint 2.7, clear_user (n=4071, to=0x12eb00000019)
at arch/x86/include/asm/uaccess_64.h:123

(gdb-guest) info registers rdi rsi
rdi            0x12eb00000019      20800526614553   ; to
rsi            0x12eb0000001a      20800526614554   ; n

The ''to'' argument looks legit, but ''n'' looks completely bonkers. That's
because ''clear_user'' was inlined into its caller and is not a callable
function. When a function is inlined, the compiler treats it more like a C
macro -- injecting the code at each call site [ref25] [ref26]. So there is no
calling convention, no prologue/epilogue, no return, and so on. Some registers
like ''RSI'' may remain intact, but we cannot rely on that. We have to look at
the disassembly and the corresponding registers to make sense of the arguments:

(gdb-guest) disassemble
Dump of assembler code for function padzero:
  ffffffff813e17a5 <+5>:   mov  rax,rdi     ; rax = 0x12eb00000019
  ffffffff813e17a8 <+8>:   and  eax,0xfff   ; rax = 0x12eb00000019 & 0xfff = 0x19
...
  ffffffff813e17af <+15>:  mov  ecx,0x1000
  ffffffff813e17b4 <+20>:  sub  rcx,rax     ; rcx = 0xfe7
...
  ffffffff813e17d6 <+54>:  xor  eax,eax     ; rax = 0
  ffffffff813e17d8 <+56>:  call 0xffffffff819e49c0 <clear_user_rep_good>
...
(gdb-guest) disassemble clear_user_rep_good
...
  ffffffff819e49c8 <+8>:   shr  rcx,0x3     ; rcx = 0xfe7 >> 3 = 0x1fc
...
  ffffffff819e49ce <+14>:  rep stos QWORD PTR es:[rdi],rax

The actual arguments to ''clear_user'' are in ''RDI'' (= ''to'') and ''RCX'' (= ''n''):

(gdb-guest) info registers rdi rcx
rdi            0x12eb00000019      20800526614553   ; to
rcx            0xfe7               4071             ; n

It's really zeroing the data in our binary starting from offset ''0x19''
(= ''0x12eb00000019 - 0x12eb00000000'') up to the end of the page.

NOTE: We could get the same information by setting a breakpoint directly on
''rep stos'' and deducing the rest from the arguments it consumes -- it writes
the value in ''RAX'' (= ''0'') to the memory at ''RDI'', repeated ''RCX'' times
in QWORD units.

===[ Up to the Frame ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Even though we've successfully verified that it's happening and where, we still
don't know why.

Since we're in the "zeroing" function, let's walk up the backtrace to see
exactly where we are:

(gdb-guest) bt
#0  clear_user
#1  padzero
#2  load_elf_binary
...

(gdb-guest) frame 2
#2  0xffffffff813e3625 in load_elf_binary (bprm=0xffff888006000c00)
    at fs/binfmt_elf.c:1245

(gdb-guest) x/20i $rip-40
  ffffffff813e3604 <load_elf_binary+2436>:   call   <set_brk>               <-.
  ffffffff813e3609 <load_elf_binary+2441>:   test   eax,eax                   |
  ffffffff813e360b <load_elf_binary+2443>:   jne    <load_elf_binary+1954>    |
  ffffffff813e3611 <load_elf_binary+2449>:   mov    rcx,QWORD PTR [rsp+0x18]  |
  ffffffff813e3616 <load_elf_binary+2454>:   cmp    rbx,rcx                   |
  ffffffff813e3619 <load_elf_binary+2457>:   je     <load_elf_binary+2477>    |
  ffffffff813e361b <load_elf_binary+2459>:   mov    rdi,QWORD PTR [rsp+0x30]  |
  ffffffff813e3620 <load_elf_binary+2464>:   call   <padzero>               <-'

We can pinpoint the exact location fairly easily, since there's only one place
in ''load_elf_binary'' where ''set_brk'' is followed by ''padzero'' [ref27]:

retval = set_brk(elf_bss, elf_brk, bss_prot);
if (retval)
    goto out_free_dentry;
if (likely(elf_bss != elf_brk) && unlikely(padzero(elf_bss))) {     // <----
    retval = -EFAULT; /* Nobody gets to see this, but.. */
    goto out_free_dentry;
}

''padzero'' is called only if ''elf_bss'' (statically allocated variables) and
''elf_brk'' (heap) are not equal. We already know their values from our earlier
inspection of ''set_brk'':

elf_bss = 0x12eb00000019
elf_brk = 0x12eb0000001a

From these two names and their values, we can already infer a lot. If you know
what the ''.bss'' section is [ref28], it's an immediate red flag -- it holds
statically allocated variables that should be initialized to 0 (at least for C
programs on Linux).

This not only explains the zeroing but also confirms that the starting offset
is ''+0x19'', just as we suspected at the beginning.

One mystery solved. Now the question is: why ''elf_bss'' and ''elf_brk'' differ?
We don't want that condition to trigger.

===[ Back to the Beginning ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We need to determine what ''elf_bss'' and ''elf_brk'' actually are -- whether
they're memory addresses (for example, on the stack) or registers. We also want
to know what values are written to them, meaning what logical data they hold =>
how they map to the ELF headers.

We know that ''set_brk'' takes both ''elf_bss'' and ''elf_brk'' as arguments, so
let's find out where they come from by disassembling the instructions before
''call <set_brk>'':

(gdb-guest) x/30i $rip-100
ffffffff813e35cb <load_elf_binary+2379>:  mov  rax,QWORD PTR [rsp+0x40]
ffffffff813e35d0 <load_elf_binary+2384>:  mov  rcx,QWORD PTR [rsp+0x18]
...
ffffffff813e35e6 <load_elf_binary+2406>:  lea  rdi,[rax+r10*1]        ; elf_bss
ffffffff813e35ea <load_elf_binary+2410>:  lea  rsi,[rax+rcx*1]        ; elf_brk
...
ffffffff813e3604 <load_elf_binary+2436>:  call 0xffffffff813e1910 <set_brk>

So both are on the stack:

elf_bss = [rsp+0x40] + r10
elf_brk = [rsp+0x40] + [rsp+0x18]

Reverse execution would be nice, since we could simply ''reverse-step'' and
''reverse-continue'' until we found what we're looking for. But with snapshots,
we can manage without it. Let's set a breakpoint at:

  ''0xffffffff813e35cb <load_elf_binary+2379>: mov rax,QWORD PTR [rsp+0x40]''

and go back in time by loading the snapshot:

(gdb-guest) hbreak *0xffffffff813e35cb

(gdb-guest) monitor loadvm load_elf_binary

(gdb-guest) c
Breakpoint 3, 0xffffffff813e35cb in load_elf_binary (bprm=0xffff888006000c00)
at fs/binfmt_elf.c:1230

(gdb-guest) x/1gx $rsp+0x40
0xffffc90000013e20:     0x0000000000000000

(gdb-guest) x/1gx $rsp+0x18
0xffffc90000013df8:     0x000012eb0000001a

(gdb-guest) p/x $rsp+0x18
$5 = 0xffffc90000013df8

(gdb-guest) watch -l *0xffffc90000013df8

(gdb-guest) monitor loadvm load_elf_binary

(gdb-guest) c
Hardware watchpoint 4: -location *0xffffc90000013df8

(gdb-guest) x/10i $rip-30
  ffffffff813e31fc <load_elf_binary+1404>: test  esi,esi
  ffffffff813e31fe <load_elf_binary+1406>: cmove rdx,rdi
  ffffffff813e3202 <load_elf_binary+1410>: add   rcx,rax
  ffffffff813e3205 <load_elf_binary+1413>: mov   QWORD PTR [rsp+0x38],rdx
  ffffffff813e320a <load_elf_binary+1418>: cmp   QWORD PTR [rsp+0x18],rcx
  ffffffff813e320f <load_elf_binary+1423>: jae   <load_elf_binary+1434>
  ffffffff813e3211 <load_elf_binary+1425>: mov   DWORD PTR [rsp+0x58],ebx
  ffffffff813e3215 <load_elf_binary+1429>: mov   QWORD PTR [rsp+0x18],rcx  <---
=>ffffffff813e321a <load_elf_binary+1434>: movzx eax,WORD PTR [rbp+0xd8]

(gdb-guest) info registers rcx
rcx            0x12eb0000001a      20800526614554

I don't wanna cause any panic, but we're lost.

===[ Where the Hell Are We?! ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is one of those cases where having the kernel source code is a very good
idea:

cd /usr/src
apt-get source linux=6.1.137-1

NOTE: We need the exact version with all patches applied (see [ref10]).

(gdb-guest) directory /usr/src/linux-6.1.137

(gdb-guest) list  *0xffffffff813e3215-20

1213  k = elf_ppnt->p_vaddr + elf_ppnt->p_filesz; // 0x12eb00000019
1214
1215  if (k > elf_bss)                            // if (0x12eb00000019 > 0)
1216          elf_bss = k;                        //   elf_bss = 0x12eb00000019
1217  if ((elf_ppnt->p_flags & PF_X) && end_code < k)
1218          end_code = k;
1219  if (end_data < k)
1220          end_data = k;
1221  k = elf_ppnt->p_vaddr + elf_ppnt->p_memsz;  // 0x12eb0000001a
1222  if (k > elf_brk) {                          // if (0x12eb0000001a > 0)
1223          bss_prot = elf_prot;
1224          elf_brk = k;                        //   elf_brk = 0x12eb0000001a
1225  }

When we step through the code and inspect the values, we get:

elf_bss = elf_ppnt->p_vaddr + elf_ppnt->p_filesz;     // 0x12eb00000019
elf_brk = elf_ppnt->p_vaddr + elf_ppnt->p_memsz;      // 0x12eb0000001a

Now we know that ''elf_bss'' is a combination of two values from the program
header: ''p_vaddr + p_filesz'' (and likewise, ''elf_brk'' is ''p_vaddr + p_memsz'').

This is another bug that could have been caught by using our readelf tool
[ref10], where the mismatch would have been clearly visible:

$ ./read_elf $f
...
p_filesz    = 0000000000000001 (1)
p_memsz     = 0000000000000002 (2)                   <---  p_memsz != p_filesz

But where's the fun in that?

===[ OUTRO ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Both time-travel debugging (or at least VM snapshots) and physical memory
watchpoints are extremely useful and can eliminate a lot of trouble.

Moreover, if time-travel worked in QEMU as intended, it would be ideal, as it
would provide deterministic execution with time-travel capabilities. Until
then, VM snapshots are still excellent capability.

Nonetheless, QEMU is not only a great virtual machine, it is also a great tool
for reverse engineering and system analysis. It has quirks and bugs, but
because it is feature-rich and open source, there is usually a workaround.

Up to this point, we've talked about vanilla QEMU, but there are also QEMU
forks such as QIRA [ref29] and PANDA [ref30]. QIRA is a GUI for timeless
debugging, though it's kinda dead (geohot works on it when he feels like it
[ref31]). PANDA, on the other hand, is actively developed and focused on
software analysis.

While I'm on the topic of time-travel debuggers, I'll mention ''rr''
(record/replay) [ref32] for low-level Linux x86 user space. It's really good for
real debugging (it was originaly developed for debugging Mozilla Firefox
[ref33]), but be careful -- ''rr'' is NOT sandboxed!

And don't forget: Hack The Planet! They're trashing the flow of data!

[[/html/2025/2025-09-11--touching_small_elfs-p1-broken_tools.html|Part 1: Understanding Small ELFs and Fixing Broken Tools]]
[[/html/2025/2025-10-06--touching_small_elfs-p2-segfaults_everywhere.html|Part 2: ELF Magic Gone Wrong: Debugging SEGFAULTs (Examples of ELF Failures)]]

===[ References ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

> [ref1] https://en.wikipedia.org/wiki/Time_travel_debugging
> [ref2] https://www.qemu.org/docs/master/system/replay.html
> [ref3] https://www.qemu.org/docs/master/devel/replay.html
> [ref4] https://wiki.qemu.org/Features/record-replay
> [ref5] https://gitlab.com/qemu-project/qemu/-/issues/2634
> [ref6] https://sourceware.org/gdb/current/onlinedocs/gdb.html/Connecting.html#index-add-new-commands-for-external-monitor
> [ref7] https://sourceware.org/pipermail/gdb-patches/1999-August/000778.html
> [ref8] https://research.h4x.cz/html/2025/2025-10-06--touching_small_elfs-p2-segfaults_everywhere.html
> [ref9] https://www.youtube.com/watch?v=3Q25sogi-xo
  * It’s RAAAAAAW Supercut (2 Million Subscribers Special) | Hell’s Kitchen
> [ref10] https://research.h4x.cz/html/2025/2025-09-11--touching_small_elfs-p1-broken_tools.html
> [ref11] https://qemu.readthedocs.io/en/v10.0.3/devel/migration/index.html
> [ref12] https://www.kernel.org/doc/html/v6.4/security/self-protection.html#kernel-address-space-layout-randomization-kaslr
> [ref13] https://www.kernel.org/doc/html/next/x86/pti.html
> [ref14] https://book.rada.re/tools/rafind2/intro.html
> [ref15] https://www.kernel.org/doc/html/v6.4/arch/x86/x86_64/mm.html
> [ref16] https://github.com/qemu/qemu/commit/e9628441df3a7aa0ee83601a0cc9111b91e2319a
> [ref17] https://github.com/cyrus-and/gdb-dashboard
> [ref18] https://github.com/qemu/qemu/blob/v10.1.1/system/cpus.c#L724
> [ref19] https://github.com/qemu/qemu/blob/v10.1.1/qapi/run-state.json
> [ref20] https://www.felixcloutier.com/x86/rep%3Arepe%3Arepz%3Arepne%3Arepnz
> [ref21] https://elixir.bootlin.com/linux/v6.1.137/source/fs/binfmt_elf.c#L112
> [ref22] https://www.man7.org/linux/man-pages/man2/brk.2.html
> [ref23] https://sourceware.org/gdb/current/onlinedocs/gdb.html/Maintenance-Commands.html
> [ref24] https://elixir.bootlin.com/linux/v6.1.137/source/arch/x86/include/asm/uaccess_64.h#L121
> [ref25] https://www.kernel.org/doc/local/inline.html
> [ref26] https://gcc.gnu.org/onlinedocs/gcc/Inline.html
> [ref27] https://elixir.bootlin.com/linux/v6.1.137/source/fs/binfmt_elf.c#L1242
> [ref28] https://en.wikipedia.org/wiki/.bss
> [ref29] https://github.com/geohot/qira
> [ref30] https://github.com/panda-re/panda/blob/dev/panda/docs/manual.md
> [ref31] https://www.youtube.com/watch?v=QleTEw0hKXQ
  * George Hotz | Programming | Improving and running QIRA from scratch! | Part3
> [ref32] https://rr-project.org/
> [ref33] https://github.com/rr-debugger/rr/wiki/Recording-Firefox