Quest for the Smallest Possible ELF x86-64 Executable on Linux

~ Fanda Uchytil

===[ INTRO ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The smallest possible executable x86-64 ELF on Linux is 57 bytes.

This is quite a claim! How will we get out of this one? Will we be able to
defend our honour? Let's put on our delving suite and delve into the depths of
find out.

                                                        ▄▀████▀▄
                                                       ██▄▀██▀▄██
                                                       █▀▀▀▀▀▀▀▀█
          Those are pretty strong words,             ▀▄█ █▀██▀█ █▄▀
                                                      ▀█ █▄██▄█ █▀
                                                       ██▄▄▀▀▄▄██
            for a 7 years old girl!                     █▀▄▀▀▄▀█
                                                       █▀▄▀▀▀▀▄▀█
                                                      ▀ ██▀██▀██ ▀
                                                      ██▀▄█▀▀█▄▀██
                                                       ▀██▄██▄██▀

===[ Officer I Swear It's a Working Example ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's start with a "working" example and build our understanding from there.

The following hexdump is the annotated xxd output of a 57-byte x86-64 binary
that's a legit executable for the Linux ELF loader:

--------------------------[ 57-byte_elf_x86-64.xxd ]---------------------------
00000000: 7f45 4c46 0000 0000 0000 0000 0000 0000  .ELF............
;         ^-------^                e_entry
       e_ident[EI_MAG]        v-----------------v
00000010: 0200 3e00 0000 0000 0000 0000 0000 0000  ..>.............
;         ^.-^ ^-.^
;          |     '--- e_machine = x86-64
;      e_type = E_EXEC
00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000030: 0000 0000 0000 3800 01                   ......8..
;                        ^.-^ ^^
;             e_ehsize ---'    '--- e_phnum
-------------------------------------------------------------------------------

Convert it into a real binary:

grep -v '^;' 57-byte_elf_x86-64.xxd | xxd -r > 57-byte_elf_x86-64
chmod 755 57-byte_elf_x86-64

And test it by running it (in a virtual machine, of course, we are not
animals):

$ ./57-byte_elf_x86-64
Segmentation fault (core dumped)

Pretty good, right?! I'm actually proud of it.

It's not just an ordinary segfault, it's a good one (unlike those in [ref1]),
because this one didn't crash the ''execve(2)'' syscall... Don't look at me
like that! Fully working examples are for normies who are slaves to the system!
So, let's see how it works...

===[ ELF Structure ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

How can a 57-byte ELF64 work when the ELF64 header itself is 64 bytes? That
should be the minimum, correct? Well, evil always finds a way, kids.

When shrinking anything down, the main question is always: which fields are
needed and which can be removed? In the case of an ELF executable, we need to
look at two places:

  - the ELF64 structure [ref7], and
  - the Linux ELF loader [ref4].

Let's start with ELF headers, they define the structure of a program. All ELF
binaries must have exactly one ELF header (= ''Elf64_Ehdr''). It's the only
header that must start at offset 0, and it defines the most basic
characteristics of an ELF binary (e.g., ELF type, architecture, entry point,
offset of the first program header, ...).

Here is the ELF64 with annotated crucial fields that the linux ELF loader
requires (more on that later):

-------------------------[ ELF64 header (from elf.h) ]-------------------------
typedef struct {
    unsigned char e_ident[16];      // the first 4 bytes = ELF magic
    uint16_t      e_type;           // are we an executable?
    uint16_t      e_machine;        // the architecture
    uint32_t      e_version;
    Elf64_Addr    e_entry;          // the start address
    Elf64_Off     e_phoff;          // the program header offset
    Elf64_Off     e_shoff;
    uint32_t      e_flags;
    uint16_t      e_ehsize;
    uint16_t      e_phentsize;      // size of the program header structure
    uint16_t      e_phnum;          // number of entries in the program header
    uint16_t      e_shentsize;
    uint16_t      e_shnum;
    uint16_t      e_shstrndx;
} Elf64_Ehdr;
-------------------------------------------------------------------------------

Only the fields with a comment are necessary for the Linux loader to run an ELF
binary (more on that in [[#Trim it! Trim it harder!]]). ELF is not just for
executable code, it's a container for data that describe executable code. Its
structure (apart from instructions) can also contain symbols, relocations,
memory, and so on. In this article, we'll focus only on the executable part of
ELF64 => ''e_type'' is either ''ET_EXEC'' (= static executable) or ''ET_DYN''
(= shared/dynamic executable).

When we set ''e_type'' to one of the executable types, the kernel ELF loader
requires one more data structure -- the program header:

------------------------[ Program Header (from elf.h) ]------------------------
typedef struct {
    uint32_t   p_type;
    uint32_t   p_flags;
    Elf64_Off  p_offset;
    Elf64_Addr p_vaddr;
    Elf64_Addr p_paddr;
    uint64_t   p_filesz;
    uint64_t   p_memsz;
    uint64_t   p_align;
} Elf64_Phdr;
-------------------------------------------------------------------------------

The program header tells the kernel how to load a binary into memory. For
example, the ELF loader in Linux kernel 6.12 [ref4] recognizes the following
types:

PHDR TYPE               DESCRIPTION
--------------------------------------------------------------------------
PT_LOAD                 how to map a binary into memory
PT_GNU_STACK            enables an executable userspace stack
PT_GNU_PROPERTY         special GNU properties (hardening, ISA specs, ...)
PT_INTERP               indicates the dynamic linker (the ELF interpreter)
PT_LOPROC...PT_HIPROC   range for processor-specific segment types

NOTE: To execute code from our binary, it must be loaded in working memory. The
only two types that do this are ''PT_LOAD'', which describes how the binary
should be mapped into memory, and ''PT_INTERP'', which tells the kernel, that
the binary needs to load a dynamic linker to resolve all dependent libraries
before execution.

Now that we have some idea of how an executable ELF is structured, we want to
know what we can remove for it to stay a valid executable.

===[ Trim it! Trim it harder! ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As mentioned in the previous section, the minimum size of an ELF64 binary
should hypothetically be 64 bytes, since that's the size of the ELF header
structure. But what about the program header, which adds another 56 bytes? We
also said that it is needed for an executable, well not just needed, but
enforced by the ELF loader. Which fields does the loader actually require, and
in which structure?

The ELF loader is mostly defined in the ''load_elf_binary'' function [ref5],
and the first thing it does is check whether the ''e_ident'' field has the
4-byte ELF magic (= ''\x7fELF''):

if (memcmp(elf_ex->e_ident, ELFMAG, SELFMAG) != 0)    // e_ident == "\177ELF"
    goto out;

''e_ident'' is an array of 16 bytes (= ''unsigned char e_ident[16]''), but only
the first 4 bytes are important. The remaining 12 bytes can be used for
whatever (wink wink, nudge nudge, say no more).

Then the loader checks whether the binary is executable, and if not, it GTFOs:

if (elf_ex->e_type != ET_EXEC && elf_ex->e_type != ET_DYN)
    goto out;

That means the ''e_type'' field is also mandatory. Let's do a quick check of
where we are in the ELF header structure and how many bytes we currently need:

-----------------------------------[ elf.h ]-----------------------------------
typedef struct {
    unsigned char e_ident[16];
    uint16_t      e_type;           <-- ET_EXEC || ET_DYN
    uint16_t      e_machine;
    uint32_t      e_version;
    Elf64_Addr    e_entry;
    Elf64_Off     e_phoff;
    Elf64_Off     e_shoff;
    uint32_t      e_flags;
    uint16_t      e_ehsize;
    uint16_t      e_phentsize;
    uint16_t      e_phnum;
    uint16_t      e_shentsize;
    uint16_t      e_shnum;
    uint16_t      e_shstrndx;
} Elf64_Ehdr;
-------------------------------------------------------------------------------

We are now at 18 bytes (we need the full 16 bytes of ''e_ident'' and 2 bytes of
''e_type'').

Next, the loader checks the architecture (''e_machine''), which in our case
must be equal to ''0x003e'' (= ''EM_X86_64''), defining the x86-64
architecture:

if (!elf_check_arch(elf_ex))      // e_machine == EM_X86_64
    goto out;

It's another 2 bytes => 20 bytes so far.

NOTE: We can ignore some checks, such as ''elf_check_fdpic'', as they don't
apply to the x86 architecture (but they may on other architectures, where
''e_ident[EI_OSABI]'' can be checked, so be cautious).

And """finally""" (for the ELF header), the loader calls ''load_elf_phdrs'',
which loads all ''PT_LOAD'' records from the program header:

elf_phdata = load_elf_phdrs(elf_ex, bprm->file);
if (!elf_phdata)
    goto out;

We need to be extra careful here: if this call fails, the whole loading process
fails! When we look into the function, it's obvious that we need at least one
program header record:

if (elf_ex->e_phentsize != sizeof(struct elf_phdr))
    goto out;

/* Sanity check the number of program headers and their total size. */
size = sizeof(struct elf_phdr) * elf_ex->e_phnum;
if (size == 0 || size > 65536 || size > ELF_MIN_ALIGN)
    goto out;

It then reads all records in the program header from the file (which might be a
problem):

/* Read in the program headers */
retval = elf_read(elf_file, elf_phdata, size, elf_ex->e_phoff);

Let's go back to the ELF header and assess where we are now. From the
conditions above, we can see that the loader requires a correct ''e_phentsize''
(which must be exactly ''sizeof (struct elf_phdr)'' => 56 bytes) and a nonzero
''e_phnum'' (it must also be below the relevant limits, but we can ignore that
since we want the bare minimum):

-----------------------------------[ elf.h ]-----------------------------------
typedef struct {
    unsigned char e_ident[16];      // 16
    uint16_t      e_type;           // 2
    uint16_t      e_machine;        // 2
    uint32_t      e_version;        // 4
    Elf64_Addr    e_entry;          // 8
    Elf64_Off     e_phoff;          // 8
    Elf64_Off     e_shoff;          // 8
    uint32_t      e_flags;          // 4
    uint16_t      e_ehsize;         // 2
    uint16_t      e_phentsize;      // 2
    uint16_t      e_phnum;          // 2    <--- WE ARE HERE
    uint16_t      e_shentsize;
    uint16_t      e_shnum;
    uint16_t      e_shstrndx;
} Elf64_Ehdr;
-------------------------------------------------------------------------------

None of the following fields are read by the kernel loader, so we're at 58
bytes for the ELF header. That's without the program header.

That said, what about the program header? We know that at least one ''sizeof
(struct elf_phdr)'' entry is required, which is 56 bytes for a 64-bit
executable. Does that bring the total to ''58 + 56 = 114''? Well, nobody said
that we cannot overlay the ELF and program headers [ref3]. Moreover, we don't
even need a valid program header entry, because the kernel fails only on
invalid ''PT_LOAD'' records. Therefore, we can set ''e_phoff'' to zero, which
points straight to the beginning of the ELF header and reduces the total number
of bytes back to 58.

===[ One Byte Less ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

So far, we have a 58-byte binary and no less, but the binary at the beginning
is only 57 bytes long. What gives?

We can trim one more byte. ''e_phnum'' is the last field in the ELF header that
is needed, it's a two-byte integer, and we can remove one of those bytes. Two
properties of x86 allow this: little-endian encoding and page-granularity
allocation.

Little-endian stores the least significant byte at the smallest address [ref6].
What that means is that if we have a two-byte number with the value 1 (=
''0x0001''), its bytes will be stored in memory in reverse order as ''01 00''.
This is one of the reasons why little-endian exists -- it makes type casting
straightforward: ''(char) 01'', ''(double) 01 00'', ''(int) 01 00 00 00'', ...
All of these represent the same value, 1, but with different type widths.

That's nice, but ''e_phnum'' is a two-byte data type (= ''uint16_t''), so that
means two bytes are read, right? Correct. And this is where
page-size-granularity allocation comes into play. For the sake of efficiency,
allocations are done on a page-size basis. So, when anything is read into
memory, even if it's just one byte, one full page is allocated (most of the
time, 4096 bytes on x86). The data are copied there, and the rest (e.g., 4095
bytes) is zeroed out. That means that when we read ''(uint16_t) e_phnum'' from
process memory, it reads an implicit zero from the page.

Here, have a picture:

          FILE                  MMAP()                  MEMORY
00000000: 4861 636b 2074        ----->        ffff0000: 4861 636b 2074
00000006: 6865 2070 6c61                      ffff0006: 6865 2070 6c61
0000000c: 6e65 7421 0a                        ffff000c: 6e65 7421 0a00
                                              ffff0012: 0000 0000 0000
                                              ffff0018: 0000 0000 0000
                                                        ...
                                              ffff0ffa: 0000 0000 0000
                                              ffff1000:

NOTE: Unfortunately, we cannot use the same trick for the program header,
because the loader reads it directly from the file (as we saw in
[[#Trim it! Trim it harder!]]). Therefore, if the structure is trimmed in the
file, the loader will fail to read the required size. There may be some trick
to this, but I didn't find one.

When we trim one byte from ''e_phnum'', we still get a valid ELF64 executable,
and the final size is 57 bytes. Except that it still segfaults!

===[ Good, bad... I'm the guy with the segfault ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is more than one type of segmentation fault [ref1][ref2]. More precisely,
some segfaults are recoverable. Therefore, there is such a thing as a "good" or
"bad" segfault. In this case, a "bad" segfault is one that causes the
''execve(2)'' syscall to fail completely (see [ref2]). On the other hand, a
"good" segfault, for our purposes, is one that occurs in user space => on an
already loaded binary that tries to execute code but fails for some reason.

The following strace output is such a "good" segfault, which occurs in the
57-byte binary from the beginning:

-------------------------------[ good segfault ]-------------------------------
$ strace -i ./57-byte_elf_x86-64
[00007ffff7e7fad7] execve("./57-byte_elf_x86-64", ["./57-byte_elf_x86-64"], 0x7fffffffea58 /* 1 var */) = 0
[0000000000000000] --- SIGSEGV {si_signo=SIGSEGV, si_code=SEGV_MAPERR, si_addr=NULL} ---
[????????????????] +++ killed by SIGSEGV (core dumped) +++
Segmentation fault (core dumped)
-------------------------------------------------------------------------------

The ''execve()'' succeeds (= returns 0), and then it fails at address 0 because
the ''e_entry'' is exactly that -- zero. Why is that good? Because the binary
is successfully running as a process and then tries to execute user space
instructions, but that is where it fails -- it's unable to execute anything
because nothing is mapped there.

Why is this useful? Well, now we at least have the theoretical possibility to
jump somewhere else in process memory that might be executable -- and if we
wish hard enough, there will be such a place...

===[ Summary of 57-byte ELF64 ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We've traveled a long way, let's summarize and finally look at some assembly
code:

NOTE: The following is ''nasm'' code [ref25], and it's very simple to read:
''db = 1 byte, dw = 2 bytes, dd = 4 bytes, dq = 8 bytes''; the meanings are
explained in the comments. ''REQUIRED'' tells you whether that field is
required by the Linux ELF loader.

--------------------------[ 57-byte_elf_x86-64.nasm ]--------------------------
BITS 64                         ;                                  USABLE?
    db  0x7F, "ELF"             ;   e_ident[EI_MAG]                -
    times 12 db 0x00            ;   e_ident[...]
    dw  0x0002                  ;   e_type      = ET_EXEC          -
    dw  0x003e                  ;   e_machine   = x86-64           -
    dd  0x00000001              ;   e_version                      yes
    dq  0x0000000000000000      ;   e_entry     = stack            yes
    dq  0x0000000000000000      ;   e_phoff     = 0                -
    dq  0x0000000000000000      ;   e_shoff                        yes
    dd  0x00000000              ;   e_flags                        yes
    dw  0x0040                  ;   e_ehsize                       yes
    dw  0x0038                  ;   e_phentsize = sizeof (phdr)    -
    db  0x01                    ;   e_phnum     = 1 entry          -
-------------------------------------------------------------------------------

If we stopped here, we would have a 57-byte ELF64 that the Linux kernel
executes correctly, but unfortunately it still fails to execute any user code.
And we would really like it to execute something! Otherwise, the executable is
more or less useless. So, what can we do with it?

First, what is at our disposal (= what does the kernel map for every process by
default)?

$ gdb -q -ex 'file ./execve_wrapper' -ex 'catch exec' -ex "run ./57-byte_elf_x86-64"
...
(gdb) info proc map
          Start Addr           End Addr       Size     Offset  Perms  objfile
      0x7ffff7ff9000     0x7ffff7ffd000     0x4000        0x0  r--p   [vvar]
      0x7ffff7ffd000     0x7ffff7fff000     0x2000        0x0  r-xp   [vdso]
      0x7ffffffde000     0x7ffffffff000    0x21000        0x0  rw-p   [stack]

Eh, not much. Not surprising, since we didn't load anything into memory. Look
where ''e_phoff'' points -- right at the beginning of the binary. That means
there is no ''PT_LOAD'' record to load anything into memory. So when the binary
is executed, no code from it is actually loaded.

We either need to somehow make a ''PT_LOAD'' record that will load our code, or
we need to use regions that are mapped by the kernel. (There might be some
possibilities in ''PT_INTERP'', but I didn't go there.)

===[ Cursed PT_LOAD ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This section is super frustrating for me. I have a long-standing battle with
the ''PT_LOAD'' type, and I'm systematically losing. So far, I have been unable
to make the smallest ELF64 with a valid ''PT_LOAD''. The smallest size "I" was
able to get is a 73-byte, self-contained, fully valid ELF64, and that's only
thanks to the tmp.out article by lm978 [ref8] (btw, this is superb work; lm978
was able to make a fully valid 73-byte ELF64 "Hello, world!"; Respect+!).
Sadly, when I apply the technique, it adds 16 bytes to the base 57-byte binary,
and that version already exists. (It also needs either more than 4 GiB of
memory or enabled memory overcommit, which I don't see as a problem.)

Well, here it is for completeness' sake, and if you want to know more (as you
should), read the OG article [ref8]:

BITS 64                         ;                                  USABLE?
    db  0x7F, "ELF"             ;   e_ident[EI_MAG]                -
_code:
    mov dil, 66                     ; 40 B7 42  ; return value
    mov al, 0x3c                    ; B0 3C     ; sys_exit
    syscall                         ; 0F 05
    db  0x00                    ;   e_ident[EI_PAD]                yes
_phdr:
    db  0x01                    ;   e_ident[EI_PAD]                yes
    db  0x00                    ;   e_ident[EI_PAD]                yes
    db  0x00                    ;   e_ident[EI_PAD]                yes
    db  0x00                    ;   e_ident[EI_PAD]                yes
    dw  0x0003                  ;   e_type        = executable     -
    dw  0x003e                  ;   e_machine     = x86-64         -
    dd  0x0000000c              ;   e_version     = ELF version    yes
    dq  0x0000000c00000000      ;   e_entry                        -
    dq  _phdr                   ;   e_phoff                        -
    dq  0x0038000000000000      ;   e_shoff                        yes
    dd  0x00000001              ;   e_flags                        yes
    dw  0x0000                  ;   e_ehsize                       yes
    dw  0x0038                  ;   e_phentsize                    -
    db  0x01                    ;   e_phnum                        -

    ; phdr padding -- explained in the "One Byte Less" section
    db      0x00
    db      0x00
    db      0x00
    db      0x00
    db      0x00
    db      0x00
    db      0x00
    db      0x00
    db      0x00
    db      0x00
    db      0x00

    ; jump padding
    db      0x00
    db      0x00
    db      0x00

    jmp short _code

So what other options are there?

We already know that ''PT_LOAD'' is needed when we want to load any part of the
binary into memory (which is usually the case). But we are under no obligation
to load anything from the program. Actually, let's double down! We're edgy kids
-- we rebel against such oppressive doctrines as ''PT_LOAD''! It's just so...so
crypto-fascist.

At the beginning, I said that we need at least one program header record.
That's right, *a* record. It doesn't need to be ''PT_LOAD''. It doesn't even
need to be one of those the kernel recognizes. (In some cases, it might be
better if it's unrecognizable by the kernel, as in the 57-byte example. If it's
skipped, it doesn't have side effects, right?)

In the previous section, we looked at the memory mapping of the 57-byte
program. There was one executable region: ''vdso''. What if we point
''e_entry'' there? there could be some interesting instructions to execute...

===[ vDSO and ROP ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

No! Just NO! There are too many problems with it. (I like vdso, but not in this
case. If you've ever played with it, you probably know what I mean.)

Virtual Dynamically-linked Shared Object (a.k.a. vDSO) is a small subset of
kernel space exported to user space for faster access to some code (like
''gettimeofday'') [ref29]. It's a virtual ELF file (that's good), and unlike
its predecessor ''vsyscall'', it was built with ASLR in mind, so its base
address changes every run (that's bad).

And I don't just mean it changes when ASLR is enabled, that's today's standard.
What I mean is that it doesn't even have the same base address on different
kernels when ASLR is disabled! That's a very big problem if we want to have at
least the illusion of portability.

-------[ With ASLR disabled: vDSO changes on different kernel versions ]-------
uname -r ; setarch "$(uname -m)" -R grep vdso /proc/self/maps
6.1.0-37-amd64
7ffff7fc8000-7ffff7fca000 r-xp 00000000 00:00 0                          [vdso]

$ uname -r ; setarch "$(uname -m)" -R grep vdso /proc/self/maps
6.12.73+deb13-amd64
7ffff7fc5000-7ffff7fc7000 r-xp 00000000 00:00 0                          [vdso]
-------------------------------------------------------------------------------

On top of that, the vDSO structure and code may also (and "often" do) change
between major kernel versions. This makes it difficult to use it as a reliable
entry point across kernel versions.

Well, that sucks! What other techniques can we use?

===[ Sacrificial Alter ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

There is one technique that would solve our problem, but it requires a lot of
outsourcing. I wouldn't care much, except for one thing that bothers me. But
let's start from the beginning.

What other regions are mapped? Well, well, well. Isn't that the legendary
stack?! It is! But it's not for free -- there are two problems. The first one
is that the stack is not executable by default since 2020 (kernel >= 5.8).
Fortunately, this is not a big issue, as we can define ''PT_GNU_STACK'' in the
program header. It wants a 3-byte sacrifice, but it's doable. The second
problem, which is more serious, is ASLR once again. We'll deal with it, but in
a cowardly way.

Let's start with ''PT_GNU_STACK''. On kernels < 5.8, the stack is executable by
default, and we don't need to do anything to execute code from the stack. We
can point the first program header to the beginning of the file and be done
with it (exactly like we did in [[#Summary of 57-byte ELF64]]).

The first program header record points to offset 0, which is the same as the
ELF magic. The ''phdr->p_type'' is nonsense, so the kernel ignores it, and the
stack remains executable. That's nice, as it gives us a 57-byte binary (as
shown at the beginning).

(Un)fortunately, Kees Cook said: "no fun allowed" in 2020 [ref10], and since
kernel version >= 5.8 [ref11], the stack is non-executable by default.
Therefore, we need to set ''PT_GNU_STACK'' with the correct permissions (=
''phdr->p_flags'') explicitly. This will add 3 bytes to our binary (but it's
still a 60-byte binary, which is fine by me).

First, we want to set ''phdr->p_type'' to ''PT_GNU_STACK''. It's defined as
''0x6474e551'' in the kernel source code [ref12]:

#define PT_LOOS         0x60000000                  /* OS-specific */
#define PT_GNU_STACK    (PT_LOOS + 0x474e551)       // => 0x6474e551

NOTE: ''PT_GNU_STACK'' exists since Linux 2.6.6 [ref9].

Second, we need to set the executable bit (= ''PF_X'') in ''phdr->p_flags'',
the loader checks if it's set:

------------------------[ load_elf_binary: stack mode ]------------------------
for (i = 0; i < elf_ex->e_phnum; i++, elf_ppnt++)
    switch (elf_ppnt->p_type) {
    case PT_GNU_STACK:
        if (elf_ppnt->p_flags & PF_X)
            executable_stack = EXSTACK_ENABLE_X;
        else
            executable_stack = EXSTACK_DISABLE_X;
        break;
-------------------------------------------------------------------------------

NOTE: From the condition above, the loader checks only if ''PF_X'' is set, and
the ''PF_X'' flag is defined as 1 [ref26], which means it's the lowest bit =>
''p_flags'' can be any odd number, and the stack will be read-write-executable
[ref13].

No more fields are required for ''PT_GNU_STACK''.

Finally, we have to place it somewhere in the binary. The first eligible place
is at the 5th position, right after the ELF magic ''\x7fELF''. There is enough
space for both ''phdr->p_type'' (4 bytes) and ''phdr->p_flags'' (4 bytes), and
those ELF header fields are not read by the kernel (see the ''REQUIRED''
column):

-------------------[ prototype_of_60-byte_elf_x86-64.nasm ]--------------------
BITS 64                         ;                                    REQUIRED
    db  0x7F, "ELF"             ;   e_ident[EI_MAG]                  yes
phdr:
    dd  0x6474e551              ;     phdr->p_type = PT_GNU_STACK    -
    dd  0x00000001              ;     phdr->p_flags = RWX            -
    db  0x00                    ;   e_ident[EI_PAD]                  -
    db  0x00                    ;   e_ident[EI_PAD]                  -
    db  0x00                    ;   e_ident[EI_PAD]                  -
    db  0x00                    ;   e_ident[EI_PAD]                  -
    dw  0x0002                  ;   e_type      = ET_EXEC            yes
    dw  0x003e                  ;   e_machine   = x86-64             yes
    dd  0x00000001              ;   e_version                        -
    dq  0x0000000000000000      ;   e_entry     = stack              yes
    dq  phdr - $$               ;   e_phoff     = 0                  yes
    dq  0x0000000000000000      ;   e_shoff                          -
    dd  0x00000000              ;   e_flags                          -
    dw  0x0040                  ;   e_ehsize                         -
    dw  0x0038                  ;   e_phentsize = sizeof (phdr)      yes
    dw  0x0001                  ;   e_phnum     = 1 entry            yes
    dw  0x0000                  ;   e_shentsize => padding           -
-------------------------------------------------------------------------------

Right! But it's still not working, because ASLR randomizes the address of the
stack, so the loader jumps into an unmmaped memory region and segfautls.

===[ Problem with Personality ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

ASLR is a pain (and I didn't figure out how to solve it). When we look at the
kernel source code, we can see that there are two ways to disable ASLR on the
fly [ref16].

One is global for the whole system through the variable ''randomize_va_space''
(this variable is set when we write to ''/proc/sys/kernel/randomize_va_space''
[ref14] or pass ''norandmaps'' as a kernel boot parameter [ref15]). And the
other one is through the personality [ref17] of a process:

const int snapshot_randomize_va_space = READ_ONCE(randomize_va_space);
if (!(current->personality & ADDR_NO_RANDOMIZE) && snapshot_randomize_va_space)
    current->flags |= PF_RANDOMIZE;

setup_new_exec(bprm);

/* Do this so that we can load the interpreter, if need be.  We will
   change some of these later */
retval = setup_arg_pages(bprm, randomize_stack_top(STACK_TOP),
                         executable_stack);

Both are problematic, because they require an action outside of the binary, and
that's lame! Regrettably, I have no choice but to invoke the rule that ASLR
must be disabled either via ''personality(2)'' or globally. (Shame? Such a
mundane thought never crossed my mind!)

===[ Outsourcing Code to the Stack ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Let's continue our stack adventure. We still have one unsolved problem. We
already know that, even though we don't map anything from the binary directly
into memory (as we don't have any ''PT_LOAD'' in the binary), we can use the
user stack. We can already make it executable, but how can we store data there
when we have no code? (The chicken-and-egg problem.)

We actually need two things:
  - a way to push our code onto the stack, and
  - the address of that code, so we can execute it.

There are several options we can use to get our data onto the user stack:

  1. number of arguments (= ''argc''),
  2. program name and its arguments (= ''argv''),
  3. environment variables (= ''env''),
  4. filename (= ''auxv[AT_EXECFN]'').

Most of them require outside effort, like running the program with special
environment variables or special arguments, etc. I don't want any more
unnecessary "user interaction" (I still have trauma caused by ASLR).
Fortunately, the kernel is kind enough and grants my wish.

Number 4, ''filename'', differs from ''argv[0]''. It is created at a program
execution. 

XXX vv
It's one of the first records put on the user stack of a process and is derived from ''bprm->filename'' (=> ''bprm->exec'' [ref20]), so it's the path how the program was executed. The kernel puts it on the stack for ''auxv[AT_EXECFN]'' [ref24], which points to it [ref19]. Unlike argc, argv, and envp, it's not a stable ABI and might disappeare in the future, but I doubt that have never seen it missing or in a different possition on x86-64.
XXX ^^

Here is a simple layout of the x86-64 stack which is set up by the kernel:

----------------------[ Stack layout (x86-64 ; no ASLR) ]----------------------
<STACK-TOP>                 (lower memory addess than <STACK-BOTTOM>)
...
auxv                        AT_EXECFN = ptr @ ----.       [ref19]
argc                                              |
argv                                              |
env                                               |
filename\0                  bprm->exec  <---------'       [ref20]
0x0000000000000000          8 bytes (sizeof (void *))     [ref21]
<STACK-BOTTOM>              0x7FFFFFFFF000                [ref22] [ref23]
-------------------------------------------------------------------------------

From the layout, we can see that the most control we have is through the
filename, because it ends at a deterministic position -- 8 bytes from the
bottom of the stack. This is perfect, as it's a known fixed address, and the
bonus is that when we use a negative offset (= from a higher address to a lower
one), it automatically trims out any path prefix => only the last part of the
filename is relevant. E.g., it doesn't matter if we run ''/bin/cat'' or
''./cat'', the ''cat'' string will always be at position: ''STACK-BOTTOM - 8 -
strlen ("cat") -1'' (-1 because of ''\0'' filename string termination).

XXX vv
Ok, the remaining stack ingredient is the address of STACK-BOTTOM without ASLR. The user stack (re)location is "finalized" in the ELF loader [ref22] and it uses ''STACK_TOP'' as the reference [ref23] for the stack bottom :). That means bottom of the user stack without ASLR is at ''0x7FFFFFFFF000'' on x86-64.
XXX ^^

===[ The Printable Code ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

We need one more piece -- runnable code. Let's (not) overcomplicate our lives;
we'll create a simple ''exit(2)'' program that exits with value ''66''.

We are doing shellcode that will be stored in a file name. On typical Unix
filesystems (e.g., ext4, xfs, zfs, tmpfs, ...), there are only two rules for a
file name we must obey:

  1. no NUL character (''\0''), and
  2. no forward slash (''/'').

Those are reserved for paths (''/'') and string termination (''\0'').

A proof of concept can be something like this code:

----------------------------[ poc_shellcode.nasm ]-----------------------------
BITS 64       ;   MEANING           C hex string
mov dil,0x42  ;   arg = 66          \x40\xb7\x42
mov al, 0x3c  ;   syscall exit      \xB0\x3C
syscall       ;   exit (66)         \x0F\x05
-------------------------------------------------------------------------------

We can use the binary output as the name of our binary, or better yet, we can
create a symlink with this name pointing to our binary:

ln -s 60-byte_elf_x86-64  $'\x40\xb7\x42\xB0\x3C\x0F\x05'

The size is 7, so we need to appropriately set ''ehdr->e_entry'' to
''0x7FFFFFFFF000 - 8 - 7 - 1'', and we can run it:

./$'\x40\xb7\x42\xB0\x3C\x0F\x05'
echo $?                             # => 66

It correctly exits with 66.

But that's lame, and we can do better! Well, sir, would you care for a small
portion of printable characters in your filename, with a hint of self-modifying
code before the main course?

Well, I do, my good sir -- I indeed do:

-------------------------[ printable_shellcode.nasm ]--------------------------
; constructing instruction 'syscall' (= 0F 05)
sub ax, 0x7270                  ; 0x0000 - 0x7a71 = 0x858f
sub ax, 0x474f                  ; 0x858f - 0x4040 = 0x454f
sub ax, 0x4132                  ; 0x454f - 0x4040 = 0x050f  => 0F 05
push rax
pop rsi                         ; rsi = 0F 05

; constructing syscall value for 'exit'
push byte 0x54
pop rax
xor al, 0x68                    ; eax = 0x3c => syscall exit

push byte 66                    ; return value
pop rdi

; modify the next intruction so it becomes 'syscall'
xor word [rel _syscall], si

_syscall: dd 0                  ; a place holder
-------------------------------------------------------------------------------

How does it work? We are very constrained by the number of instructions we can
use. Printable ASCII codes range from 0x20 (space) to 0x7e (tilde), and
anything outside this range is considered a non-printable/special character. We
want only instruction opcodes that are in the printable range. The main goal is
the possibility of writing it on a keyboard without problems.

The number of eligible instructions on x86-64 is very small, but the set of
fully usable instructions with arguments and registers is even smaller. I took
a known alphanumeric opcode table [ref27] and the x86-64 opcode table [ref28]
and iterated from that.

For example, there is no ''mov'', so we have to be creative and use multiple
''sub ax'' (it results in printable opcodes ''f-'') to get the value we want.
In the code above, we want to get 0x050f. Both bytes are unprintable. Also,
when we use ''sub'', we go backward from zero and underflow to our desired
value, and we must use values whose bytes result in printable characters:

sub ax, 0x7270      ; f-pr
sub ax, 0x474f      ; f-OG
sub ax, 0x4132      ; f-2A

Then we load it into ''rsi'', because we need it there for the last
instruction: ''xor word [rel _syscall], si''. This instruction is a real treat;
it modifies the code after it, but it's not that simple. Look at how the
instruction really looks:

xor [RIP + 0x7], si        ; 66 31 35 00 00 00 00

Yeah, correct! Those are the infamous NUL bytes. The exact ones that are
forbidden. And moreover, there are four of them. How do we get out of this
predicament? Well, do you remember the stack layout from [[#Outsourcing Code tothe Stack]]? It looks like this:
XXX ^^ formating

...
filename\0
0x0000000000000000          8 bytes (sizeof (void *))
<STACK-BOTTOM>              0x7FFFFFFFF000

We have 8 NUL bytes from ''sizeof (void *)'', plus 1 NUL byte from the filename
termination at our disposal. It would be a waste not to use them... We can take
the ''xor'' instruction and trim its tail of NUL bytes, and when executed,
they'll be there.

And if you are asking why I didn't simply use ''push 0x3c'', as it's also a
printable and valid file name character. I could, but ''0x3c'' is the ''<''
character, which is used as file descriptor 0 redirection in the shell. We
could run it in quotes or escape it (''\<'')), and it would run fine, but why
not make it sexier when we can? Am I right, boiz?!

When we build the shellcode, we get these 25 printable characters:

f-prf-OGf-2AP^jTX4hjB_f15

This can be used as a filename without any worries (it could be made prettier,
but that's a homework for a good reader).

===[ 60-byte Frankenstein's Monster ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Now we've got every part we need:

  - the trimmed ELF structure,
  - the executable stack for kernel versions >= 5.8,
  - the address of the stack,
  - the code we want to run,
  - the place to store the code,
  - the printable filename.

At last, let's put it all together. Prepare one big duct tape.

We'll take the prototype from [[#Too Much to Sacrifice]] and fill in the entry
point (''ehdr->e_entry''). Our code is stored in the file name, and that is
stored on the stack near its bottom. The code length is 25 bytes, and when we
plug it into the equation from [[#Outsourcing Code to the Stack]], we get:

e_entry =  STACK-BOTTOM - 8 - code_length - 1 =
        = 0x7FFFFFFFF000 - 8 - 1 - 25 =
        = 0X7FFFFFFFEFDE

NOTE: nasm allows for simple equations, so I just rearranged it and kept it
like that, because then it can be easily edited without recalculating it again.

The final code will look like this:

--------------------------[ 60-byte_elf_x86-64.nasm ]--------------------------
BITS 64                         ;                                    REQUIRED
    db  0x7F, "ELF"             ;   e_ident[EI_MAG]                  yes
phdr:
    dd  0x6474e551              ;     phdr->p_type = PT_GNU_STACK    -
    dd  0x00000001              ;     phdr->p_flags = RWX            -
    db  0x00                    ;   e_ident[EI_PAD]                  -
    db  0x00                    ;   e_ident[EI_PAD]                  -
    db  0x00                    ;   e_ident[EI_PAD]                  -
    db  0x00                    ;   e_ident[EI_PAD]                  -
    dw  0x0002                  ;   e_type      = ET_EXEC            yes
    dw  0x003e                  ;   e_machine   = x86-64             yes
    dd  0x00000001              ;   e_version                        -
    dq  0x7FFFFFFFF000-8-1 -25  ;   e_entry     = stack              yes
    dq  phdr - $$               ;   e_phoff     = 0                  yes
    dq  0x0000000000000000      ;   e_shoff                          -
    dd  0x00000000              ;   e_flags                          -
    dw  0x0040                  ;   e_ehsize                         -
    dw  0x0038                  ;   e_phentsize = sizeof (phdr)      yes
    dw  0x0001                  ;   e_phnum     = 1 entry            yes
    dw  0x0000                  ;   e_shentsize => padding           -
-------------------------------------------------------------------------------

Build it, make it executable, and marvel at our glorious hexdump:

nasm -f bin 60-byte_elf_x86-64.nasm -o 60-byte_elf_x86-64
chmod 755 60-byte_elf_x86-64
xxd 60-byte_elf_x86-64

--------------------------[ 60-byte_elf_x86-64.xxd ]---------------------------
00000000: 7f45 4c46 51e5 7464 0100 0000 0000 0000  .ELFQ.td........
00000010: 0200 3e00 0100 0000 deef ffff ff7f 0000  ..>.............
00000020: 0400 0000 0000 0000 0000 0000 0000 0000  ................
00000030: 0000 0000 4000 3800 0100 0000            ....@.8.....
-------------------------------------------------------------------------------

NOTE: Conversion from xxd hexdump to binary:

xxd -r 60-byte_elf_x86-64.xxd > 60-byte_elf_x86-64

Now, let's implant the code into the file name. It's a file name, so we'll use
a symlink:

ln -s 60-byte_elf_x86-64  f-prf-OGf-2AP^jTX4hjB_f15

Disable ASLR and run it:

echo 0 > /proc/sys/kernel/randomize_va_space
./f-prf-OGf-2AP^jTX4hjB_f15
echo $?                                       # => exits with 66

NOTE: If you don't want to disable ASLR for the whole system, it can be
disabled per program using a tool from util-linux [ref18]:
''setarch "$(uname -m)" -R -- ./f-prf-OGf-2AP^jTX4hjB_f15'' (it sets the
''ADDR_NO_RANDOMIZE''
personality flag).

NOTE: If we run a kernel older than 5.8, we could instead use the
''57-byte_elf_x86-64'' binary ([[#Summary of 57-byte ELF64]]), fill in the
correct ''e_entry'', and then it would run the same as the 60-byte binary.

===[ Is This a Happy Ending or a Sad Ending? ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

So, kids, what have we learned?

We found out that a Linux executable program needs two headers: the ELF header
and the program header. The program header needs to exist, but it doesn't need
to contain a valid record.

The ELF header can be trimmed down to 57 bytes, as it seems to be the limit of
what the kernel is willing to tolerate for an x86-64 binary to be executed.
It's "limited" by the non-zero value in the number of program header records
(''ehdr->e_phnum'').

The kernel does not need ''PT_LOAD'' to correctly load and execute a program,
but it wants at least one program header record. This limits the file size to
''ehdr->e_phoff + sizeof (Elf64_Phdr)'', as those records are read from the
file and the loader checks the size of every record it reads (= it must match
''sizeof (Elf64_Phdr)'').

If we have an executable user stack, we can store the program instructions in a
filename, which is then stored on the stack by the kernel and can be used as
the entry point (''ehdr->e_entry'').

And that's all I have, guys. Be kind, and HACK THE PLANET! as they're trashing
our rights, man! They're trashing the flow of data!

                .---------------------------------.                     ,,-.
               /                                   \                ..(     \
              |    Marge, I'm confused. Is this     |              (        /
              |   a happy ending or a sad ending?   |             (        )
              /                                    /             (         )
             //''---------------------------------'             (         /
   _//\_    /                                                  (         )
  /     \             .------------------------------.         (        (
 |       |           /                                \        ( ~ ~ .. )
 |  (.)(.)          |  It's an ending, that's enough.  |       (.)(.) (  )
 C   _---_)          \                                 \      _-       C)
  | |  __|            '------------------------------''\\    (__       |
  |  \__/                                                \      '---   |
  /___ |                                                           \   |
 /____\ /                                                          OoooO
|       \                                                         /     \

===[ References ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

> [ref1] https://research.h4x.cz/html/2025/2025-10-06--touching_small_elfs-p2-segfaults_everywhere.html
> [ref2] https://research.h4x.cz/html/2025/2025-10-24--touching_small_elfs-p3-broken_time-machine.html#bug_4_chefs_kiss
> [ref3] https://research.h4x.cz/html/2025/2025-09-11--touching_small_elfs-p1-broken_tools.html
> [ref4] https://elixir.bootlin.com/linux/v6.12.57/source/fs/binfmt_elf.c
> [ref5] https://elixir.bootlin.com/linux/v6.12.57/source/fs/binfmt_elf.c#L819
> [ref6] https://en.wikipedia.org/wiki/Endianness
> [ref7] https://www.man7.org/linux/man-pages/man5/elf.5.html
> [ref8] https://tmpout.sh/3/22.html
> [ref9] https://www.kernel.org/doc/Documentation/userspace-api/ELF.rst
> [ref10] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/arch/x86/include/asm/elf.h?h=v5.8&id=122306117afe4ba202b5e57c61dfbeffc5c41387
> [ref11] https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/tree/arch/x86/include/asm/elf.h?h=v5.8#n282
> [ref12] https://elixir.bootlin.com/linux/v6.12.57/source/include/uapi/linux/elf.h#L39
> [ref13] https://elixir.bootlin.com/linux/v6.12.57/source/fs/binfmt_elf.c#L926
> [ref14] https://elixir.bootlin.com/linux/v6.12.57/source/kernel/sysctl.c#L1904
> [ref15] https://elixir.bootlin.com/linux/v6.12.57/source/mm/memory.c#L160
> [ref16] https://elixir.bootlin.com/linux/v6.12.57/source/fs/binfmt_elf.c#L1007
> [ref17] https://www.man7.org/linux/man-pages/man2/personality.2.html
> [ref18] https://git.kernel.org/pub/scm/utils/util-linux/util-linux.git/tree/sys-utils/setarch.c
> [ref19] https://elixir.bootlin.com/linux/v6.12.57/source/fs/binfmt_elf.c#L261
> [ref20] https://elixir.bootlin.com/linux/v6.12.57/source/fs/exec.c#L1959
> [ref21] https://elixir.bootlin.com/linux/v6.12.57/source/fs/exec.c#L295
> [ref22] https://elixir.bootlin.com/linux/v6.12.57/source/fs/binfmt_elf.c#L1015
> [ref23] https://elixir.bootlin.com/linux/v6.12.57/source/arch/x86/include/asm/page_64_types.h#L80
> [ref24] https://www.man7.org/linux/man-pages/man3/getauxval.3.html
> [ref25] https://www.nasm.us/doc/nasmdoci.html
> [ref26] https://elixir.bootlin.com/linux/v6.12.57/source/include/uapi/linux/elf.h#L247
> [ref27] https://dl.packetstormsecurity.net/papers/shellcode/alpha.pdf
> [ref28] http://ref.x86asm.net/geek64.html
> [ref29] https://www.man7.org/linux/man-pages/man7/vdso.7.html