AppArmor Signal Filtering and TTY Signals: Killing the Unkillable

[HOME] [TXT]

===[ TL;DR ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

AppArmor provides a mechanism to ignore signals sent to a process, even the
'SIGKILL' and 'SIGSTOP' signals that should not be blocked.

The most common methods to send signals from userspace are the 'kill(2)'
syscall and TTY key bindings like 'Ctrl-C' (which sends the 'SIGINT'
signal). However, AppArmor does not block TTY signals since the TTY driver does
not use the security module.

If an AppArmor signal filter is applied, sending the 'SIGTSTP' signal
('Ctrl-Z') through TTY can permanently suspend a process group. This happens
because the TTY driver does not support the 'SIGCONT' signal, and AppArmor
will filter any attempts to send the 'SIGCONT' signal with the '*kill(2)'
syscalls to the process group. As a result, the processes will remain in the
sleep state indefinitely.

Fortunately, there are some workarounds to this problem. Here are some possible
solutions:

1. Attaching GDB to the stopped process and killing the process with the
'kill' command (HANS! Kill it with PTRACE_KILL):

--------------------------------[ PTRACE_KILL ]--------------------------------
gdb -q --nx --batch -p "$(pgrep -f app_with_AppArmor_signal_filter)" -ex kill
-------------------------------------------------------------------------------

2. Sending the 'SIGINT' signal to the process group through TTY by pressing
'Ctrl-C', and then making a single step to let the process signal handler
handle it: HANS! Kill it with SS TTY signal.

3. Calling the 'exit_group' syscall (HANS! Kill it with RIP):

--------------------------------[ exit_group ]---------------------------------
gdb -q --nx --batch -p "$(pgrep -f app_with_AppArmor_signal_filter)" \
  -ex 'set {short} $rip = 0x050f' \
  -ex 'set $rax = 0xe7' \
  -ex 'set $rdi = 0' \
  -ex 'si'
-------------------------------------------------------------------------------

4. Invoking the OOM killer in a sub cgroup (v2) (OOM Slaughterhouse):

-----------------------------------[ code ]------------------------------------
mkdir /sys/fs/cgroup/oom_kill_me
cd /sys/fs/cgroup/oom_kill_me
pgrep -f 'app_with_AppArmor_signal_filter' > cgroup.procs
echo 0 | tee memory.high memory.max memory.swap.max memory.swap.high
echo 1 > memory.oom.group
bash -c 'echo $$ > cgroup.procs ; ls'
-------------------------------------------------------------------------------


In the article below, we will explore why this behavior is occurring, what it
means, and what actions we can take to address it (without kernel hacking). We
will briefly look at the kernel signal delivery, process states, signals from
TTY, and along the way, we will utilize some powerful kernel features such as
ptrace, ftrace, and kprobes. So, buckle up, fun awaits us!



===[ AppArmor's Signal Blockade ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

AppArmor is one of the Linux security modules (LSM) that provides functionality
to restrict application's capabilities. It is similar to SELinux but it is
simpler and it also lacks some SELinux abilities (we will not tackle SELinux
in this article, but it will probably have the same behavior).

When creating an AppArmor profile for an application, it starts at the most
restrictive level, and then the user can explicitly allow access to files,
capabilities, signals, mounts, etc. This is great for quickly sandboxing an
untrusted application, such as firefox. However, it can have some side effects
when we forget about signals.

There are two possibilities for how an application can interact with signals:
sending a signal and receiving a signal. By default, both are disallowed by
AppArmor, which is very good at signal filtering. To the extent that it can
even block the 'SIGKILL' signal sent by a user with full privileges (root).

===[ Blocking Unblockable Signals ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Historically on Linux, there are two signals that cannot be ignored:
'SIGKILL' and 'SIGSTOP'. The 'signal(7)' man page states that [ref1]:

  The signals SIGKILL and SIGSTOP cannot be caught, blocked, or ignored.

Those are pretty strong words :), because we already know that this is not
entirely true. It is possible to prevent an application from receiving signals,
including 'SIGKILL' and 'SIGSTOP', through the use of AppArmor.

But wait, there's more! Signals can be sent from multiple places in the kernel.
For our purposes, we can divide them into two groups:

  1. Signals sent explicitly to a process by userspace syscalls, such as
     'kill(2)' or 'tkill(2)'.

  2. Signals sent to a process by the kernel. Typically due to some kind of
     violation like a CPU exception or trap.

(SIDE NOTE: We can determine to some extent what type of signal we are dealing
with by checking the 'siginfo_t.si_code', which indicates why the signal was
sent [ref2]. This information can be obtained using tools such as 'strace'.)

While AppArmor can block userspace signals, kernel signals are typically not
filtered (for reasons like OOM killing). This can have some side effects, but
in order to understand them, we need to know something about the process states
in which a process can be.

===[ Process is RUNNING away! STOP it! ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Every executed process goes through a life cycle in which it passes through
different stages [ref3]. For example:

  NEW -> RUNNING -> INTERRUPTIBLE -> RUNNING -> STOPPED -> RUNNING -> ZOMBIE

We are particularly interested in the stopped state. A process can enter the
stopped state through receiving the 'SIGSTOP' or 'SIGTSTP' signal or by
the ptrace attach [ref4]. 

The 'SIGSTOP' and 'SIGTSTP' signals have the same effect, but 'SIGTSTP'
can be caught by the program signal handler while 'SIGSTOP' cannot. A process
can be removed from the stopped state when we send it the 'SIGCONT' signal
(the process state will be set to RUNNING [ref4]).

From userspace, we can send signals using the 'kill(2)' syscall. On Linux,
there are some convenient tools such as 'kill(1)' and 'pkill(1)', which are
wrappers for the syscall. Here is an example of how to send the 'SIGTSTP'
signal to a process:

-----------------[ Process control with SIGSTOP and SIGCONT ]------------------
# sleep 1337 &
[1] 245791

# pid="$(pgrep -f 'sleep 1337')"

# grep ^State /proc/$pid/status
State:  S (sleeping)

# kill -TSTP "$pid"

# grep ^State /proc/$pid/status
State:  T (stopped)

# kill -CONT "$pid"

# grep ^State /proc/$pid/status
State:  S (sleeping)
-------------------------------------------------------------------------------

In this example, we stopped the 'sleep' program by sending it the 'SIGTSTP'
signal, checked its state reading its status, and then made it run again by
sending it the 'CONT' signal.

But, wait for it... there is one other way to send signals as a user, but
with more privileges than the user has...

===[ Signals from the Depths of the VT ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The virtual console (terminal emulator, TTY, pseudo terminal, pty, etc.) is an
ancient Egyptian technology of a user interface that allows users to interact
with the system through a keyboard and display. As with all ancient magic,
pseudoterminals are an absolute mess with lots of arcane code. The main problem
with TTY is that it is implemented directly in the kernel as a driver and
therefore has very privileged access to the system's internals. Furthermore, it
implements zero functionality from the security module, which means it is
unaffected by AppArmor [ref5].

One of the features of TTY is the ability to send predefined signals by
pressing 'CTRL' and a special character. Probably the most (in)famous is
'Ctrl-C', which sends the 'SIGINT' signal to a process group running in the
foreground.

Process group is a special ID that every process has. Most often, it is called
PGRP or PGID [ref6] and can be shared with other processes. When multiple
processes share the same PGRP, they are in the same process group, and the
whole group can be killed (i.e., can receive a signal) just by one 'kill(2)'
call. The process group concept exists for grouping all parts of a pipe command
[ref7]. If we use a pipe command chain in a terminal and then terminate it, for
example, with 'Ctrl-C', we want to send the signal to all programs in the
chain (i.e., we do not want to send a signal to just one command).

We will look at what happens when we press 'Ctrl-C' in a terminal.

----------------------------[ Dummy pipe commnad ]-----------------------------
bash -c 'sleep 20 ; echo brm' | cat | cat -
-------------------------------------------------------------------------------

The pipe command mentioned above will create a process group with PGRP 99456
and four members. The group leader is the first 'bash' process (its PID
corresponds to the PGRP):

--------------------[ Processes and their PIDs and PGRPs ]---------------------
# ps axwww -o pid,ppid,pgrp,command --forest
    PID    PPID    PGRP COMMAND
      1       0       1 /sbin/init
  97373       1   97373  \_ /bin/bash
  99456   97373   99456      \_ bash -c sleep 200 ; echo brm
  99459   99456   99456      |   \_ sleep 200
  99457   97373   99456      \_ cat
  99458   97373   99456      \_ cat -
-------------------------------------------------------------------------------

We will attach 'strace' to all PIDs in the process group and monitor which
signals are delivered and where they come from:

----------------------------[ Strace-ing the PGRP ]----------------------------
strace -p '99456 99459 99457 99458'
-------------------------------------------------------------------------------

When we press 'Ctrl-C' in the terminal where the pipe command was executed,
the TTY driver in the kernel will send the 'SIGINT' signal to the process
group 99456:

------------------[ strace output of the TTY SIGINT signal ]-------------------
[pid 99457] --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
[pid 99456] --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
[pid 99459] --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
[pid 99458] --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---
-------------------------------------------------------------------------------

(For more details, see: APPENDIX C: kill_pgrp.)

===[ Suspending and Resuming Processes from a Terminal ]~~~~~~~~~~~~~~~~~~~~~~~

There are three signals implemented in the TTY driver: 'SIGINT', 'SIGQUIT',
and 'SIGTSTP' [ref9]. Shortcuts for these signals can be set by the 'stty'
command [ref10] and their default key bindings are:

---------------------------[ stty signal shortcuts ]---------------------------
# stty -a
...
intr = ^C       # SIGINT  -> Ctrl-C
quit = ^\       # SIGQUIT -> Ctrl-\
susp = ^Z       # SIGTSTP -> Ctrl-Z
...
-------------------------------------------------------------------------------

However, there is no implementation of the 'SIGCONT' signal to wake up
stopped processes. This functionality is typically provided by the shell
through commands named 'fg' or 'bg', which send the process the 'SIGCONT'
signal to wake it up.

Let's create an example where we suspend a process and then wake it up:

--------------------[ Suspending and awakening a process ]---------------------
# strace -f ./tty_c-z.strace -- sleep 10
^Z
[1]+  Stopped                 sleep 100

# fg
sleep 100
-------------------------------------------------------------------------------

The strace output will help us understand the origin of the signals:

-----------------------[ Strace log: ./tty_c-z.strace ]------------------------
restart_syscall(<... resuming interrupted read ...>) = ? ERESTART_RESTARTBLOCK
(Interrupted by signal)
--- SIGTSTP {si_signo=SIGTSTP, si_code=SI_KERNEL} ---
--- stopped by SIGTSTP ---

--- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=88336, si_uid=0} ---
restart_syscall(<... resuming interrupted restart_syscall ...>
-------------------------------------------------------------------------------

There is a clear difference between the two signals. The 'SIGTSTP' signal was
sent by the kernel (from the TTY driver), while the 'SIGCONT' signal was sent
by userspace. Specifically, the 'SIGCONT' signal was sent by the parent shell
when we invoked the 'fg' command. (For more strace details of the parent
process, see: APPENDIX B: Bash fg (SIGCONT) with and without AppArmor.)

===[ Why We're Here: The Journey to the Big Picture ]~~~~~~~~~~~~~~~~~~~~~~~~~~

We have finally reached the main reason why this article exists! Hooray! Let's
recap:

  1. We have an AppArmor profile in place for an application that filters **ALL
     received** signals from userspace, including 'SIGKILL'.

  2. AppArmor restrictions typically do not apply to signals comimg from the
     kernel.

  3. We can send the 'SIGTSTP' signal using the TTY driver, which is very
     privileged kernel code.

  4. In normal circumstances, we would send the 'SIGCONT' signal using the
     'kill(2)' syscall to wake up a stopped process. However, it will be
     filtered by the AppArmor profile.

With this in mind, the 'SIGTSTP' signal ('Ctrl-Z') sent by TTY has
interesting consequences. Let's create a proof of concept:

The following AppArmor rule will filter receiving and sending signals for
'/tmp/sleep':

-------------------------[ /etc/apparmor.d/tmp.sleep ]-------------------------
/tmp/sleep {
  /** rm,
}

-------------------------------------------------------------------------------

-------------------------[ Preparing the environment ]-------------------------
# id
uid=0(root) gid=0(root) groups=0(root)

# cp -a /bin/sleep /tmp/sleep

# apparmor_parser -r /etc/apparmor.d/tmp.sleep

# /tmp/sleep 1337 &
-------------------------------------------------------------------------------

-----------------------[ Testing the AppArmor profile ]------------------------
# pkill -f '/tmp/sleep 1337'
pkill: killing pid 2947998 failed: Permission denied

# dmesg | tail -1
audit: type=1400 audit(1677699941.805:127): apparmor="DENIED"
operation="signal" profile="/tmp/sleep" pid=2948318 comm="pkill"
requested_mask="receive" denied_mask="receive" signal=term peer="unconfined"

# pkill -KILL -f '/tmp/sleep 1337'
pkill: killing pid 2947998 failed: Permission denied

# dmesg | tail -1
audit: type=1400 audit(1677700015.679:128): apparmor="DENIED"
operation="signal" profile="/tmp/sleep" pid=2951143 comm="pkill"
requested_mask="receive" denied_mask="receive" signal=kill peer="unconfined"
-------------------------------------------------------------------------------

Now that we know our setup is working correctly, how about a little TTY black
magic? In the next example, we will execute our special 'sleep' command in
the foreground, suspend it (push it in the background) by pressing 'Ctrl-Z',
and then bring it back to the foreground using the 'fg' command:

---------------------[ Stopping and resuming the process ]---------------------
# /tmp/sleep 1337
^Z
[1]+  Stopped                 /tmp/sleep 1337

# fg
/tmp/sleep 1337
-------------------------------------------------------------------------------

... and nothing happens! The program is not running. Why? Because the delivery
of the 'SIGCONT' signal failed! It was sent by the parent bash process by the
'kill(2)' syscall and it was filtered by the AppArmor signal filter (see
APPENDIX B: Bash fg (SIGCONT) with and without AppArmor).

This should not be a big problem, since we can send 'SIGINT'('Ctrl-C') or
'SIGQUIT' ('Ctrl-\') from the TTY session to terminate the process, right?
RIGHT?!

------------------------[ Trying to kill the process ]-------------------------
/tmp/sleep 1000
^C^C^\^\^C^C^\^\^Z^Z
-------------------------------------------------------------------------------

Sadly this does not work.

===[ The Signal Queue: One at a Time Please ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

From the previous section, we found out that TTY enables us to send three
signals: 'SIGINT', 'SIGQUIT', and 'SIGTSTP'. Can we use 'SIGINT' or
'SIGQUIT' to cancel 'SIGTSTP'? Unforunately no. This is because signals are
queued, meaning that if we send 'SIGTSTP' followed by 'SIGINT' to a
process, there will be two signals in the "signal queue": an unfinished
'SIGTSTP' and not-yet-processed 'SIGINT'.

Here is a simple demonstration:

---------------------------[ terminal-01 -- victim ]---------------------------
cat             # `cat` doesn't have an AppArmor profile
-------------------------------------------------------------------------------

-------------------------[ terminal-02 -- aggressor ]--------------------------
pid="$(pidof cat)"
kill -TSTP "$pid"   # 1.
kill -INT  "$pid"   # 2.
kill -QUIT "$pid"   # 3.
kill -CONT "$pid"   # 4.
-------------------------------------------------------------------------------

In this scenario, the first 'kill' command will stop the 'cat' command. The
second and third 'kill' commands will seem to do nothing, but in reality, the
signals are delivered and added to the signal queue as pending [ref11]. When
the fourth 'kill' command is sent, the process will start running again and
begin processing signals from the signal queue in the order they were
delivered. When this happens, the 'cat' command should be interrupted by the
first 'SIGINT' signal, as it does not handle signals.

Great! We have created a forever sleeping process that cannot be killed off.
Besides rebooting the system, what other options do we have?

===[ Kernel Signals? But from where? ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Why don't we take a detour straight into the depths of the kernel and explore
how signals are called, specifically to understand the difference between
calling the 'kill(2)' syscall and pressing 'Ctrl-C' in a terminal?

To start, we need to locate the entry point for each signal sender. For the
'kill(2)' syscall, this is straightforward as we have control over sending
it. We can create a small C program that sends a signal and use 'ftrace'
[ref8] to observe what is called in the kernel. The program will take two
arguments, just like the 'kill(2)' syscall: the PID number (where a negative
value represents a process group) and a signal number (the 'SIGINT' signal
corresponds to number 2 -- see 'kill -l').

------------------------------[ ftracing_kill.c ]------------------------------
// gcc ftracing_kill.c -o ftracing_kill
#include <stdio.h>
#include <stdlib.h>
#include <signal.h>
#include <unistd.h>

int main (int argc, char *argv[])
{
    fprintf (stderr, "Waiting for <ENTER> (PID=%d) ...", getpid ());
    read (0, &argc, 1);   // We will start ftrace here

    kill (atoi (argv[1]), atoi (argv[2]));

    read (0, &argc, 1);   // Wait here, we don't want exit garbage.
    return 0;
}
-------------------------------------------------------------------------------

When we run the program (which is going to kill the process group of the sleep
command), we end up waiting for '<ENTER>':

------------------------------[ Command console ]------------------------------
# sleep 1337 &
[1] 220258

# ./ftracing_kill -220258 2
Waiting for <ENTER> (PID=%d) ...
-------------------------------------------------------------------------------

At this moment, we will create an 'ftrace' tracer that will record all kernel
function calls associated with the PID of our program:

---------------------[ Enable ftrace for the PID 220268 ]----------------------
cd /sys/kernel/debug/tracing
echo 220268 > set_ftrace_pid
echo function_graph > current_tracer
echo 1 > tracing_on
pv trace_pipe > /dev/shm/ftrace.sys_kill  # cat is ok, but pv gives us progress
-------------------------------------------------------------------------------

After we press ENTER in the command console, the program calls the 'kill(2)'
syscall, which sends the 'SIGINT' signal to the process group. The ftrace
tracer records this event, and we terminate the ftrace tracing, capturing only
a small number of kernel calls:

------------------------------[ Disable ftrace ]-------------------------------
^C                          # Ctrl-C for terminating pv
echo nop > current_tracer
echo 0 > tracing_on
echo > trace
echo > set_ftrace_pid
-------------------------------------------------------------------------------

When we examine (and filter) the output of the ftrace tracer, it tells us
exactly what was called (although not how). Here is the trimmed call tree:

-------------------------[ /dev/shm/ftrace.sys_kill ]--------------------------
__x64_sys_kill
  kill_something_info
    __kill_pgrp_info
      check_kill_permission
        security_task_kill
          apparmor_task_kill
      do_send_sig_info
        send_signal
          __send_signal
            complete_signal
-------------------------------------------------------------------------------

The output clearly shows an AppArmor check. If an AppArmor signal filter were
active at that moment, it would have prevented our 'kill(2)' call.
Interestingly, the check is performed on the sending side, rather than the
receiving side. This is because it would be complicated to handle higher
priority signals sent by the kernel that must be delivered, such as when the
OOM killer needs to kill processes. While it is interesting to see the AppArmor
check in action, we want to compare the call tree with the calls made when
sending signals from the TTY.

The entry point from the TTY driver is harder to find because we do not know
the function that is called when we send a signal, for example by pressing
'Ctrl-\'. There are multiple ways to approach this, and it depends heavily on
what we are trying to find. Therefore, I will describe the approach I used in
this particular case and the reasoning behind it:

1. I know that 'Ctrl-\' is sending the 'SIGQUIT' signal, so I will use its
name to search for any occurrences in the kernel source code. I typically use
'cscope' on the locally cloned kernel source tree (although
elixir.bootlin.com is pretty good, but it lacks features that I use in
'cscope'). In this case, I searched for 'Find this C symbol: SIGQUIT' and
it was fairly easy to spot a function named 'n_tty_receive_char_special'.

2. To confirm whether the function 'n_tty_receive_char_special' is called
when pressing 'Ctrl-\', I used another powerful instrumentation called
kprobes, which can be accessed from the same ftrace interface. How cool is
that? The setup process is slightly different:

-------------[ Inserting a kprobe on n_tty_receive_char_special ]--------------
echo 1 > options/stacktrace
echo 'p:k/f1 n_tty_receive_char_special' > kprobe_events
echo 1 > tracing_on
echo 1 > events/k/enable
cat -v trace_pipe
-------------------------------------------------------------------------------

After pressing 'Ctrl-\' on another terminal (not the one where the ''cat -v
trace_pipe' command is running :)), I checked the 'cat -v trace_pipe'', and
it showed the call event and the stack trace:

----------------------------[ The kprobes output ]-----------------------------
   kworker/u4:2-220626  [001]  3252523.867037: f1: (n_tty_receive_char_special+0x0/0xa70)
   kworker/u4:2-220626  [001]  3252523.867082: <stack trace>
=> n_tty_receive_char_special
=> n_tty_receive_buf_common
=> tty_port_default_receive_buf
=> flush_to_ldisc
=> process_one_work
=> worker_thread
=> kthread
=> ret_from_fork
-------------------------------------------------------------------------------

Then, I terminated the 'cat -v trace_pipe' command with 'Ctrl-C' (a little
ironic, huh? :)) and stopped the kprobe ftrace tracer:

---------------------------[ Disabling the kprobes ]---------------------------
^C                      # Terminate the cat command
echo 0 > tracing_on
echo 0 > events/k/enable
echo 0 > options/stacktrace
echo > kprobe_events
-------------------------------------------------------------------------------

3. This confirms that the function 'n_tty_receive_char_special' is triggered
when sending a TTY signal by pressing 'Ctrl-\', and the stack trace provided
the actual entry point for me to ftrace. Note that it is not practical to trace
functions that belong to a kworker (also some functions are untraceable). At
the end, I chose the 'n_tty_receive_char_special' function because every
other function from the stack trace is called by activity in a terminal, which
I confirmed by using kprobe on them).

-------------[ Enable ftrace only for n_tty_receive_char_special ]-------------
cd /sys/kernel/debug/tracing
echo > set_ftrace_pid
echo n_tty_receive_char_special > set_graph_function
echo function_graph > current_tracer
echo 1 > tracing_on
pv trace_pipe > /dev/shm/ftrace.n_tty_receive_char_special
-------------------------------------------------------------------------------

Once again, I pressed 'Ctrl-\' on another terminal to trigger the function,
and then stopped the tracer:

-----------------------------------[ code ]------------------------------------
^C                          # Ctrl-C for terminating pv
echo nop > current_tracer
echo 0 > tracing_on
echo > trace
echo > set_ftrace_pid
-------------------------------------------------------------------------------

The trimmed output of the call tree looks like:

----------------[ /dev/shm/ftrace.n_tty_receive_char_special ]-----------------
n_tty_receive_char_special
  n_tty_receive_signal_char
    isig
      kill_pgrp
        do_send_sig_info
          send_signal
            __send_signal
              complete_signal
-------------------------------------------------------------------------------

Up until the point of the 'do_send_sig_info' function call, the call trees
look quite different. The TTY call tree lacks security checks and is initiated
by a privileged kernel process, while the rest of the calls are more or less
the same. This provides valuable insight into the difference between sending
signals through the 'kill(2)' syscall versus using special keys on a console.

===[ Ptrace to the Rescue: Enters GDB ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Ptrace provides mechanisms for tracing and manipulating a process, such as
stopping it, setting its registers, reading or writing its memory, sending and
receiving signals, and more. For our purposes, ptrace roughly works as follows:

  1. A traced thread (tracee) is connected to a tracer process (the one calling
     ptrace()).

  2. The tracee will be stopped whenever some event occurs (e.g. signal
     delivery).

GDB (GNU Debugger) is a powerful yet quirky tool. We will be using GDB as a
convenient frontend for the 'ptrace(2)' syscall.

The connection between a tracer and tracee can be seen in the process state:

--------------------[ Attaching GDB to a stopped process ]---------------------
# /bin/sleep 1337
^Z
[1]+  Stopped                 /bin/sleep 1337

# pid="$(pgrep -f 'sleep 1337')"

# grep ^State: /proc/$pid/status
State:  T (stopped)

# gdb -q --nx -p "$pid"
...
Program received signal SIGTSTP, Stopped (user).
__GI___clock_nanosleep (clock_id=clock_id@entry=0, flags=flags@entry=0,
  req=<optimized out>, rem=<optimized out>) at
  ../sysdeps/unix/sysv/linux/clock_nanosleep.c:79
(gdb) 

# grep ^State: /proc/$pid/status
State:  t (tracing stop)
-------------------------------------------------------------------------------

That is very interesting! It looks like it slightly changed the state from the
stopped (caused by the 'SIGTSTP' signal) to the tracing stop (caused by the
ptrace). It seems that ptrace took over the stopped process (unsurprisingly,
there is a lot of code for that [ref12]). Can we kill it with ptrace? Yes, we
can!

-------------------[ Killing the stopped process with GDB ]--------------------
(gdb) kill
Kill the program being debugged? (y or n) y
[Inferior 1 (process 192761) killed]

(gdb) bt
No stack.

(gdb) quit

[1]+  Killed                  /bin/sleep 1337
-------------------------------------------------------------------------------

We just sent the 'SIGKILL' signal by the 'ptrace(2)' syscall (on the
background, GDB calls: 'ptrace(PTRACE_KILL, 192761)', try to strace gdb :)).
Will it work on a program that has an AppArmor signal filter? Let's find out...

===[ HANS! Kill it with PTRACE_KILL ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In the following example, we will send 'PTRACE_KILL' to the stopped process
that has an AppArmor profile filtering all received signals. If our assumptions
are correct, it should kill the process.

----------------[ Killing a stopped process with a gdb script ]----------------
# /tmp/sleep 1337
^Z
[1]+  Stopped                 /tmp/sleep 1337

# fg
/tmp/sleep 1000
^C^C^\^\^C^C^\^\^Z^Z

# gdb -q --nx --batch -p "$(pgrep -f '/tmp/sleep 1337')" -ex 'kill'
0x00007f446eb2750a in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
Kill the program being debugged? (y or n) [answered Y; input not from terminal]
[Inferior 1 (process 3048689) killed]
-------------------------------------------------------------------------------

It works! However, the ptrace's 'PTRACE_KILL' uses a different mechanism for
killing a process than sending the 'SIGKILL' signal through the 'kill(2)'
syscall and unfortunately, it has some problems. Here is a quote from the
'ptrace(2)' man page [ref13]:

  PTRACE_KILL
      Send the tracee a SIGKILL to terminate it.  (addr and data are ignored.)

      This  operation  is  deprecated;  do  not  use  it!  Instead, send a
      SIGKILL directly using kill(2) or tgkill(2).  The problem with
      PTRACE_KILL is that it requires the tracee to be in signal-delivery-stop,
      otherwise  it  may not work (i.e., may complete successfully but won't
      kill the tracee).  By contrast, sending a SIGKILL directly has no such
      limitation.

Well, what are other ways to terminate a stopped process?

===[ HANS! Kill it with SS TTY signal ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Interestingly enough when we run GDB, we are able to access the command line
and perform single stepping. This suggests that the program is no longer in the
stopped state and its execution can be manipulated. Let's test it by examining
the signal delivery from the signal queue, mentioned in the previous section: 
The Signal Queue: One at a Time Please.

Our steps will be:

  1. Start a program with an active AppArmor signal filter.

  2. Suspend the program with 'SIGTSTP' by pressing 'Ctrl-Z' in a TTY.

  3. Try to wake up the process with the shell's 'fg' command.

  4. Send the 'SIGQUIT' signal to the program by pressing 'Ctrl-\' in
     the TTY.

  5. Attach GDB to the program.

  6. Perform a single step using GDB.

---------------------[ Single stepping a stopped program ]---------------------
# /tmp/sleep 1337
^Z
[1]+  Stopped                 /tmp/sleep 1337

# fg
/tmp/sleep 1000
^\

# gdb -q --nx -p "$(pgrep -f '/tmp/sleep 1337')"

Program received signal SIGTSTP, Stopped (user).
0x00007f34c787c50a in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) si

Program received signal SIGQUIT, Quit.
0x00007f34c787c50a in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) si

Program terminated with signal SIGQUIT, Quit.
The program no longer exists.
(gdb) quit
-------------------------------------------------------------------------------

The signal from the TTY was delivered after a single step. This indicates that
ptrace disrupts the ongoing stopped state, but it also enables the process to
handle signals as they arrive.

===[ HANS! Kill it with RIP ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Another way to kill a program is by manipulating its registers, specifically by
modifying the instruction pointer (in x86-64 known as the 'RIP' register).

We can set it to '0' (or any other value that is not mapped) and cause a
program crash with the infamous segmentation fault (the signal 'SIGSEGV').
Alternatively, we can do some nifty stuff when we set it to some legit memory
area and start single stepping. This opens up a range of possibilities. For
example, we can "ROP" (Return Oriented Programming) our way to the end of a
program. We just need to find the function that does the 'exit_group' syscall
or the 'syscall' instruction and we can call any syscall we want.

In the example below, we will create the 'syscall' instruction, and through
register manipulation, we will call the 'exit_group' syscall:

(SIDE NOTE: The 'exit_group' syscall is necessary for terminating the entire
program because modern programs often have multiple threads, and the
'exit(2)' syscall terminates only the calling thread. [ref19])

1. Attach GDB to the program (with active AppArmor signal filter):

---------------------[ Attaching to the stopped process ]----------------------
# gdb -q --nx -p "$(pgrep -f '/tmp/sleep 1337')"

Program received signal SIGTSTP, Stopped (user).
0x00007f2d43ff550a in clock_nanosleep () from /lib/x86_64-linux-gnu/libc.so.6
-------------------------------------------------------------------------------

SIDE NOTE: Since we are currently in glibc, it is likely that the 'syscall'
instruction is located nearby. When we look at the backtrace, we can see that
we are in the 'clock_nanosleep' syscall, and when we disassemble the current
frame, we can observe, that the 'syscall' instruction is located two bytes
above the current value of 'RIP':

-----------------[ Finding the syscall instruction in glibc ]------------------
(gdb) x/2i $rip-2
   0x7f2d43ff5508 <clock_nanosleep+40>: syscall 
=> 0x7f2d43ff550a <clock_nanosleep+42>: mov    edx,eax
-------------------------------------------------------------------------------

(SIDE NOTE: Be cautious when decrementing values from the 'RIP' register and
disassembling from that value. x86 has variable length instructions that are
not aligned, and there is no reliable way to determine the beginning of an
instruction [ref14]. There is a high chance that we may not hit the beginning
of an instruction and end up disassembling completely different, but still
valid, instructions! Fortunately, x86 assembly is self-healing, so if we
intelligently tweak the value we decrement with, we can locate the actual
instructions. I typically decrement/increment 'RIP' by multiples of 16 and
search for plausible instructions.)


2. We could reuse the 'syscall' instruction lying in glibc, however, we will
create our own 'syscall' instruction. In GDB, it is fairly simple to inject
raw data. We just need to know the opcode number for the 'syscall'
instruction, which is '0f 05' in x86-64 assembly [ref15]. To inject it, we
will enter it as a number (because injecting byte/char arrays in GDB are
unforgivable sin against nature). However, we need to convert the opcode number
to little endian format since the x86 architecture uses little endian byte
ordering. So, the opcode '0f 05' becomes '05 0f'.

---------[ Rewriting a memmory address with the syscall instruction ]----------
(gdb) set {short} $rip = 0x050f

(gdb) x/1i $rip
=> 0x7f2d43ff550a <clock_nanosleep+42>: syscall
-------------------------------------------------------------------------------

(SIDE NOTE: Rewriting part of a shared library will cause a page fault, and the
kernel will copy the page and make it private for the 'sleep' process.)


3. To call the 'exit_group' syscall in assembly, we need to set the 'RAX'
register to the syscall number for 'exit_group', which is 231 (0xe7) on
x86-64 (this number can be found by grepping '/usr/include'). We also want to
set the 'RDI' register to the exit status code that the program will return
to the parent process.

(SIDE NOTE: In general, the syscall number for each syscall is specified in the
ABI for the architecture. On Linux x86-64, these syscall numbers are defined in
the '/usr/include/x86_64-linux-gnu/asm/unistd_64.h' header file. The ABI
calling convention can be found in the 'syscall(2)' manpage [ref16].)

----------------[ Finding the syscall number for 'exit_group' ]----------------
# grep -r exit_group
/usr/include/x86_64-linux-gnu/asm/unistd_64.h:#define __NR_exit_group 231
-------------------------------------------------------------------------------

Preparing the registers for the 'exit_group' syscall on x86-64 will look like
this:

----------------------[ Prepating the 'exit_group' call ]----------------------
(gdb) set $rax = 0xe7
(gdb) set $rdi = 0
-------------------------------------------------------------------------------


4. Finally, we just need to make a single step:

----------------------------[ Calling the syscall ]----------------------------
(gdb) si
[Inferior 1 (process 3530764) exited with code 00]
-------------------------------------------------------------------------------

And we are done. We successfully terminated the stopped program.

===[ OOM Slaughterhouse ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Cool, cool. But what if ptracing is completely forbidden, for example by Yama
[ref17]?

Although ptrace may work in most cases, there are other solutions to this
problem. We will ignore nuclear options such as Sys Rq kill or loading a kernel
module, which could be used to kill the process group (btw, I won't lie,
loading a module was my first thought when I encountered this problem). There
is another hack I want to cover here -- abusing the infamous OOM (Out of
Memory) killer. However, we cannot (ab)use the OOM killer directly on the
system, as it would likely kill everything except for the stopped program. We
need to create a safe environment where the OOM killer can run its killing
spree.

Cgroups is a kernel feature that allows users to have better control over what
and how many resources processes can consume. This is particularly useful when
we want to constrain the maximum amount of memory that processes in a group can
use. The good news is that the OOM killer is aware of memory cgroups, so if
memory usage exceeds the allowed limit, it will only kill processes in that
specific cgroup.

Let's create an example:

  1. We will prepare the unkillable process.

  2. We will create a new cgroup v2 and move the process to it.

  3. We will set all memory limits to zero.

  4. We will enable 'memory.oom.group', because we want the OOM killer to
     kill every process in the group.

  5. We will spawn a new shell, move it to the cgroup, and force it to make a
     memory allocation by executing a command. (In newer kernels, there is the
     option 'memory.reclaim' [ref18], which I did not test, but it could work
     instead of making an allocation by another process.)

--------------------[ Invoking the OOM killer in a cgroup ]--------------------
# /tmp/sleep 1337
^Z
[1]+  Stopped                 /tmp/sleep 1337

# mkdir /sys/fs/cgroup/oom_kill_me

# cd /sys/fs/cgroup/oom_kill_me

# pgrep -f '/tmp/sleep 1337' > cgroup.procs

# echo 0 | tee memory.high memory.max memory.swap.max memory.swap.high

# echo 1 > memory.oom.group

# bash -c 'echo $$ > cgroup.procs ; ls'
[1]+  Killed                  /bin/sleep 1337  (wd: /sys/fs/cgroup)
(wd now: /sys/fs/cgroup/oom_kill_me)
Killed

# dmesg | tail -2
Memory cgroup out of memory: Killed process 215843 (sleep) ...
Memory cgroup out of memory: Killed process 215936 (bash) ...
-------------------------------------------------------------------------------

And that's it.

===[ Conclusion ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

I do not know the exact reasons why the TTY driver does not use the security
module. My guess is that users expect the terminal to work (for some reason
:)). But seriously, what needs to be done when we want to implement AppArmor
signal filtering in the TTY driver? It would not be as simple as calling the
'kill_pgrp' function as it is now, and we definitely cannot block all
processes from receiving a signal. We know that the TTY sends a signal to the
whole process group, therefore we would have to iterate over all PIDs and check
if there is a security context for each of them (and we must not forget about
locking). Yeah, I understand why nobody wants to do it.

OK, that is all for today. We have learned some edgy debugging techniques and
some inner workings of the kernel. Now, flee my children! You are free!



===[ References ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

[ref1] https://manpages.debian.org/bullseye/manpages/signal.7.en.html

[ref2] https://manpages.debian.org/bullseye/manpages-dev/sigaction.2.en.html

[ref3] https://elixir.bootlin.com/linux/v5.10.170/source/include/linux/sched.h#L68
  * https://elixir.bootlin.com/linux/v5.10.170/source/fs/proc/array.c#L129

> [ref4]
  * https://elixir.bootlin.com/linux/v5.10.170/source/kernel/ptrace.c#L480
  * https://elixir.bootlin.com/linux/v5.10.170/source/kernel/signal.c#L2366
  * https://elixir.bootlin.com/linux/v5.10.170/source/kernel/signal.c#L931

[ref5] https://elixir.bootlin.com/linux/v5.10.170/source/drivers/tty/n_tty.c
  * 'grep' for 'security' in '/drivers/tty'.

[ref6] https://manpages.debian.org/bullseye/manpages/proc.5.en.html

> [ref7] Michael Kerrisk -- The Linux Programming Interface: A Linux and UNIX System Programming Handbook
  * Chapter 28: Process Creation and Program Execution in More Detail
  * ISBN-10: 1-59327-220-0 ; ISBN-13: 978-1-59327-220-3

[ref8] https://docs.kernel.org/trace/ftrace.html

[ref9] https://elixir.bootlin.com/linux/v5.10.170/source/drivers/tty/n_tty.c#L1266

[ref10] https://www.man7.org/linux/man-pages/man1/stty.1.html

[ref11] https://elixir.bootlin.com/linux/v5.10.170/source/kernel/signal.c#L2295
  * https://elixir.bootlin.com/linux/v5.10.170/source/kernel/signal.c#L412

> [ref12]
  * https://elixir.bootlin.com/linux/v5.10.170/source/kernel/signal.c
    * Search for 'ptrace'.
  * https://elixir.bootlin.com/linux/v5.10.170/source/kernel/ptrace.c
    * Search for 'STOPPED'.

[ref13] https://manpages.debian.org/bullseye/manpages-dev/ptrace.2.en.html

[ref14] https://www.righto.com/2023/02/how-8086-processor-determines-length-of.html

> [ref15] Intel 64 and IA-32 Architectures Software Developer's Manual Combined Volumes: 1, 2A, 2B, 2C, 2D, 3A, 3B, 3C, 3D, and 4
  * Vol. 2B 4-695 ; SYSCALL-Fast System Call
  * Order Number: 325462-078US ; December 2022
  * https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html

[ref16] https://manpages.debian.org/bullseye/manpages-dev/syscall.2.en.html#Architecture_calling_conventions

[ref17] https://www.kernel.org/doc/html/latest/admin-guide/LSM/Yama.html

[ref18] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

[ref19] https://manpages.debian.org/bullseye/manpages-dev/exit.2.en.html
  * https://manpages.debian.org/bullseye/manpages-dev/exit_group.2.en.html
  * https://manpages.debian.org/bullseye/manpages-dev/exit.3.en.html

===[ APPENDIX A: Versions ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-----------------------------------[ code ]------------------------------------
Debian GNU/Linux 11 (bullseye)
Linux 5.10.0-21-amd64 #1 SMP Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux

bash         5.1-2+deb11u1
strace       5.10-1
gdb          10.1-1.7
gcc          4:10.2.1-1
-------------------------------------------------------------------------------

===[ APPENDIX B: Bash fg (SIGCONT) with and without AppArmor ]~~~~~~~~~~~~~~~~~

These are strace outputs of the parent shell after invoking 'fg'.

Without AppArmor in place:

-----------------------------------[ code ]------------------------------------
[pid 617677] 20:18:17.396896 kill(-1132628, SIGCONT) = 0
[pid 617677] 20:18:17.396958 rt_sigprocmask(SIG_SETMASK, [CHLD],  <unfinished ...>
[pid 1132628] 20:18:17.396985 --- SIGCONT {si_signo=SIGCONT, si_code=SI_USER, si_pid=617677, si_uid=1001} ---
[pid 617677] 20:18:17.397011 <... rt_sigprocmask resumed>NULL, 8) = 0
[pid 1132628] 20:18:17.397035 restart_syscall(<... resuming interrupted clock_nanosleep ...> <unfinished ...>
[pid 617677] 20:18:17.397063 rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
[pid 617677] 20:18:17.397122 wait4(-1, [{WIFCONTINUED(s)}], WSTOPPED|WCONTINUED, NULL) = 1132628
-------------------------------------------------------------------------------

With apparmor enforcement:

-----------------------------------[ code ]------------------------------------
[pid 617677] 20:22:46.563918 kill(-1142643, SIGCONT) = -1 EACCES (Permission denied)
[pid 617677] 20:22:46.563993 rt_sigprocmask(SIG_SETMASK, [CHLD], NULL, 8) = 0
[pid 617677] 20:22:46.564050 rt_sigprocmask(SIG_BLOCK, [CHLD], [CHLD], 8) = 0
[pid 617677] 20:22:46.564094 wait4(-1, 
-------------------------------------------------------------------------------

===[ APPENDIX C: kill_pgrp ]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

When pressing 'Ctrl-C' on a terminal, the kernel function 'kill_pgrp' sends
a signal to a process group that runs in the foreground.

An example:

-------------------------------[ pipe commands ]-------------------------------
(sleep 100 ; echo brm ) | cat | cat -
-------------------------------------------------------------------------------

-------------------------------[ process tree ]--------------------------------
# ps axwww -o pid,pgrp,command --forest

PID     PGRP    COMMAND
 97373   97373  \_ /bin/bash
105017  105017      \_ /bin/bash
105024  105017      |   \_ sleep 100
105022  105017      \_ cat
105023  105017      \_ cat -
-------------------------------------------------------------------------------

NOTE: The kprobe command is bound to the exact kernel version (in this case it
was 'Debian 5.10.162-1 (2023-01-21) x86_64 GNU/Linux')!

------------------------------[ kprobe command ]-------------------------------
kprobe 'do_send_sig_info+0 pid=+2232(%dx):s32'
-------------------------------------------------------------------------------

--------------------------------[ kprobe log ]---------------------------------
 kworker/u4:2-104947    (do_send_sig_info+0x0/0xc0) pid=105024
 kworker/u4:2-104947    (do_send_sig_info+0x0/0xc0) pid=105023
 kworker/u4:2-104947    (do_send_sig_info+0x0/0xc0) pid=105022
 kworker/u4:2-104947    (do_send_sig_info+0x0/0xc0) pid=105017
         bash-105017    (do_send_sig_info+0x0/0xc0) pid=105017   # <-- [1]
-------------------------------------------------------------------------------

[1] The process leader is 'bash' with PID 105017 and when it gets the signal
form the TTY, it catches the signal and sends it to the group leader, which is
it itself.