深入Android风控边界-syscall

bdwms

2023-10-07

Android

风控的本质是剔除坏人，保留好人。

在上文中我们讨论了 Android 风控中的 Java 层的边界对抗，但是在实际应用中更为核心的对抗还是在 native 层，而在 native 层如果我们直接调用 libc 库的函数去进行环境检测信息提取，是非常容易被 Hook 拦截，比较常见的就是使用 syscall 直接进行系统调用，本文主要就围绕 syscall 的边界攻防以及原理进行展开分析。

SYSCALL 调用

上篇文章中我们提到了用户态与内核台，为了让用户态的程序能够访问受限的硬件设备等，需要在用户空间中发起请求，由内核空间负责执行，这也是为了保证内核的稳定和安全性，syscall 就是连接用户空间和内核空间的桥梁。Linux系统，用户空间通过向内核空间发出Syscall，产生软中断，从而让程序陷入内核态，执行相应的操作。对于 Android 里面的 syscall 调用，可以直接参考 bionic 中的代码：

arm64-v8a：

/* Move syscall No. from x0 to x8 */
mov     x8, x0
/* Move syscall parameters from x1 thru x6 to x0 thru x5 */
mov     x0, x1
mov     x1, x2
mov     x2, x3
mov     x3, x4
mov     x4, x5
mov     x5, x6
svc     #0

/* check if syscall returned successfully */
cmn     x0, #(MAX_ERRNO + 1)
cneg    x0, x0, hi
b.hi    __set_errno_internal

ret

armeabi-v7a：

mov     ip, sp
stmfd   sp!, {r4, r5, r6, r7}
.cfi_def_cfa_offset 16
.cfi_rel_offset r4, 0
.cfi_rel_offset r5, 4
.cfi_rel_offset r6, 8
.cfi_rel_offset r7, 12
mov     r7, r0
mov     r0, r1
mov     r1, r2
mov     r2, r3
ldmfd   ip, {r3, r4, r5, r6}
swi     #0
ldmfd   sp!, {r4, r5, r6, r7}
.cfi_def_cfa_offset 0
cmn     r0, #(MAX_ERRNO + 1)
bxls    lr
neg     r0, r0
b       __set_errno_internal

以上我们就可以传入系统调用号，其他参数来通过汇编进行系统调用了，攻击者对于库函数的 Hook 拦截例如 inlinehook 就没用了。

Ptrace 原理

svc 是进入内核态的最后一道门槛，那么是否能够对直接的系统调用进行拦截呢。ptrace 就是由 linux 系统提供的强大调试工具，gdb 等工具就是依靠 ptrace 进行实现的。ptrace 能够在非 Root 模式下直接调试自己本进程，可以注入暂停进程，修改寄存器等，同时 ptrace 也可以调试内核态。这里我们简单看了解下 ptrace：

int main() {
    pid_t child_pid = fork();
    if (child_pid == 0) {
        if (ptrace(PTRACE_TRACEME, 0, 0, 0) == -1) {
            perror("ptrace");
            exit(1);
        }
        kill(getpid(), SIGSTOP);  // Stop until the parent is ready to continue
        child();
    } else if (child_pid > 0) {
        int status;
        waitpid(child_pid, &status, 0);  // Wait for child to stop
        if (ptrace(PTRACE_SETOPTIONS, child_pid, 0, PTRACE_O_TRACESECCOMP | PTRACE_O_TRACESYSGOOD) == -1) {
            perror("ptrace");
            exit(1);
        }
        ptrace(PTRACE_SYSCALL, child_pid, 0, 0);  // Continue the child
        parent(child_pid);
    } else {
        perror("fork");
        exit(1);
    }
    return 0;
}

父进程 fork() 出子进程后，子进程通过 ptrace(PTRACE_TRACEME, 0, 0, 0) 告诉父进程自己可以被调试，并停止运行，直到父进程发送调试命令才会继续执行。父进程可以继续操作设置，并通过 ptrace(PTRACE_SYSCALL, child_pid, 0, 0) 来继续子进程。

这里简单说明下 PTRACE_SYSCALL 和 PTRACE_CONT 的区别，两者都是恢复暂停的被调试进程，PTRACE_SYSCALL 可以理解为单步执行，当遇到下一个 svc 的 syscall_enter 和 syscall_exit 时还是会暂停，而 PTRACE_CONT 则会继续执行。

Seccomp-BRF 原理

在上文中其实我们就可以通过 ptrace 利用 PTRACE_SYSCALL 来进行单步调试，并在进入和退出阶段来修改寄存器中的内容，从而来完成对函数调用的参数与返回值的修改。但是有一个很严重的问题就是每一个 syscall 都进行拦截的话，就会产生大量的暂停与进程间通信，造成程序的卡顿。那么有没有一种方法能够解决这个问题呢？当然是有的，接下来我们来介绍下 seccomp 和 BRF。

Seccomp

seccomp（全称secure computing mode）是 linux kernel支持的一种安全机制。在 linux 系统里，大量的系统调用直接暴露给用户态程序。但是并不是所有的系统调用都被需要，而且不安全的代码滥用系统调用会对系统造成安全威胁。通过 seccomp，我们限制程序使用某些系统调用，这样可以减少系统的暴露面，同时是程序进入一种“安全”的状态，这也就是沙箱的雏形。

seccomp 分为两种模式，在 strict 模式下仅支持，其他系统调用就会直接 SIGKILL 退出。：

1	read，write，_exit，sigreturn

Linux 3.5 之后引入了第二个模式：SECCOMP_MODE_FILTER，支持 BRF 过滤模式下自定义规则，在该模式下允许指定哪些系统调用可以，seccomp-brf 基于 Berkeley 的数据包进行规则匹配，从而完成过滤。

BRF

BPF在1992年的tcpdump程序中首次提出，tcpdump是一个网络数据包的监控工具，但是由于数据包的数量很大，而且将内核空间捕获到的数据包传输到用户空间会带来很多不必要的性能损耗，所以要对数据包进行过滤，只保留感兴趣的那一部分，而在内核中过滤感兴趣的数据包比在用户空间中进行过滤更有效。BPF 就是提供了一种进行内核过滤的方法，因此用户空间只需要处理经过内核过滤的后感兴趣的数据包。

BPF定义了一个可以在内核内实现的虚拟机(VM)。该虚拟机有以下特性：

简单指令集
- 小型指令集
- 所有的指令大小相同
- 实现过程简单、快速
只有分支向前指令
- 程序是有向无环图(DAGs)，没有循环
易于验证程序的有效性/安全性
- 简单的指令集⇒可以验证操作码和参数
- 可以检测死代码
- 程序必须以 Return 结束
- BPF过滤器程序仅限于4096条指令

struct sock_filter filter[] = {
    BPF_STMT(BPF_LD + BPF_W + BPF_ABS, offsetof(struct seccomp_data, nr)),
    BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __NR_write, 0, 1),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE),
    BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
};
struct sock_fprog prog = {
    .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])), /* BPF指令的数量 */
    .filter = filter, /*指向BPF数组的指针 */
};

可以看下这个 BRF 规则的具体含义：

这段 BPF（Berkeley Packet Filter）代码定义了一个 seccomp 过滤器，它用于决定如何处理系统调用。这个特定的过滤器是为了处理 write 系统调用。下面是每行代码的解释：

1	BPF_STMT(BPF_LD + BPF_W + BPF_ABS, offsetof(struct seccomp_data, nr))

这条指令加载系统调用号到 BPF 的寄存器。BPF_LD 是加载指令，BPF_W 表示我们正在加载一个字（word，即32位），BPF_ABS 表示我们正在做一个绝对加载，offsetof(struct seccomp_data, nr) 获取 seccomp_data 结构中 nr 字段的偏移量，这个字段包含了系统调用号。

1	BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __NR_write, 0, 1)

这条指令检查加载的系统调用号是否是 write 系统调用的调用号（__NR_write）。BPF_JMP 是跳转指令，BPF_JEQ 是等于比较，BPF_K 表示我们正在比较一个常数。如果系统调用号是 write，则跳过下一条指令（即 BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE)），直接执行 BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW)，否则，执行下一条指令。

1	BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE)

如果执行到这条指令，说明系统调用号不是 write。这条指令告诉 seccomp 将这个系统调用事件发送给 ptrace，以便我们的代码可以处理它。

1	BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW)

如果执行到这条指令，说明系统调用号是 write。这条指令告诉 seccomp 允许这个系统调用继续执行，不要将它发送给 ptrace。

通过这种方式，这个过滤器只允许 write 系统调用直接执行，而所有其他系统调用都会被发送到 ptrace，以便我们的代码可以处理它们。

SVC 调用拦截实践

这里我们简单写一个例子来尝试拦截 syscall，具体项目参考 https://github.com/birdmanwings/seccomp-test：

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <signal.h>
#include <sys/ptrace.h>
#include <sys/wait.h>
#include <sys/prctl.h>
#include <linux/filter.h>
#include <linux/seccomp.h>
#include <elf.h>
#include <sys/uio.h>  // for struct iovec

#include "syscall_arm64.h"

void child() {
    struct sock_filter filter[] = {
        BPF_STMT(BPF_LD + BPF_W + BPF_ABS, offsetof(struct seccomp_data, nr)),
        BPF_JUMP(BPF_JMP + BPF_JEQ + BPF_K, __NR_write, 0, 1),
        BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_TRACE),
        BPF_STMT(BPF_RET + BPF_K, SECCOMP_RET_ALLOW),
    };
    struct sock_fprog prog = {
        .len = (unsigned short)(sizeof(filter)/sizeof(filter[0])),
        .filter = filter,
    };

    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
        perror("prctl(NO_NEW_PRIVS)");
        exit(1);
    }

    if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &prog)) {
        perror("prctl(SECCOMP)");
        exit(1);
    }

    write(STDOUT_FILENO, "Hello, World!\n", 13);  // This will be intercepted
    exit(0);
}

void parent(pid_t child_pid) {
    int status;
    struct iovec iov;
    struct user_pt_regs regs;
    int entryexit = 0;  // 用于跟踪系统调用的入口和退出点

    iov.iov_base = &regs;
    iov.iov_len = sizeof(regs);

    while (waitpid(child_pid, &status, 0)) {
        if (WIFEXITED(status) || WIFSIGNALED(status))
            break;

        if (WIFSTOPPED(status)) {
            // 当停止是由于系统调用引起时
            if (status >> 8 == (SIGTRAP | 0x80)) {
                if (ptrace(PTRACE_GETREGSET, child_pid, NT_PRSTATUS, &iov) == -1) {
                    perror("ptrace");
                    exit(1);
                }
                // 通过 entryexit 变量区分系统调用的入口和退出点
                if (entryexit == 0) {
                    printf("System Call Entry: Number %llu\n", regs.regs[8]);
                    entryexit = 1;
                } else {
                    printf("System Call Exit: Number %llu\n", regs.regs[8]);
                    entryexit = 0;
                }
            }
            // Check if the stop was caused by a seccomp event
            else if ((status >> 8) == (SIGTRAP | (PTRACE_EVENT_SECCOMP << 8))) {
                printf("Parent: PTRACE_EVENT_SECCOMP event received.\n");
            }

            if (ptrace(PTRACE_SYSCALL, child_pid, 0, 0) == -1) {
                perror("ptrace");
                exit(1);
            }
        }
    }
}

int main() {
    pid_t child_pid = fork();
    if (child_pid == 0) {
        if (ptrace(PTRACE_TRACEME, 0, 0, 0) == -1) {
            perror("ptrace");
            exit(1);
        }
        kill(getpid(), SIGSTOP);  // Stop until the parent is ready to continue
        child();
    } else if (child_pid > 0) {
        int status;
        waitpid(child_pid, &status, 0);  // Wait for child to stop
        if (ptrace(PTRACE_SETOPTIONS, child_pid, 0, PTRACE_O_TRACESECCOMP | PTRACE_O_TRACESYSGOOD) == -1) {
            perror("ptrace");
            exit(1);
        }
        ptrace(PTRACE_SYSCALL, child_pid, 0, 0);  // Continue the child
        parent(child_pid);
    } else {
        perror("fork");
        exit(1);
    }
    return 0;
}

main 中父进程 fork 出子进程。
父进程 waitpid(child_pid, &status, 0); 等待子进程暂停。
子进程设置自己为 PTRACE_TRACEME，然后 kill(getpid(), SIGSTOP); 暂停自己。
父进程收到子进程暂停的信号，开始设置 PTRACE_SETOPTIONS，PTRACE_O_TRACESECCOMP 用于设置 seccomp，PTRACE_O_TRACESYSGOOD 用于设置区分系统调用。
父进程 ptrace(PTRACE_SYSCALL, child_pid, 0, 0); 继续子进程。
子进程中就是设置 BRF 规则并开启对 write 系统调用的过滤，然后调用 write 函数。
父进程就是一个循环 wait 子进程的信号，监控的就是 seccomp 信号和系统调用信号，这里笔者维护了一个 entryexit 变量来观察系统调用与 seccomp 触发的顺序。

这里补充一个知识点，在系统调用时存在两个阶段 syscall-entry-stop 和 syscall-exit-stop，而触发 seccomp 规则发出的 PTRACE_EVENT_SECCOMP 信号在 Linux kernel 版本 4.8 起 PTRACE_EVENT_SECCOMP 是在 syscall-entry-stop 和 syscall-exit-stop 之间，在 Linux 3.5 到 Linux 4.7 之间 PTRACE_EVENT_SECCOMP 是在 syscall-entry-stop 之前的，具体可以翻看 ptrace 手册。

然后编译执行并推送到 Android 手机中，首先可以看下笔者的这部小米测试机版本：

1
2

# cat /proc/version
Linux version 4.19.157-perf-gf8cdf943b2b3 (builder@pangu-build-component-vendor-96687-7t4pv-tshlp-g02mt) (clang version 10.0.7 for Android NDK, GNU ld (binutils-2.27-bd24d23f) 2.27.0.20170315) #1 SMP PREEMPT Wed Jun 7 08:25:17 UTC 2023

linux 版本是 4.19，然后再看执行结果：

可以看到 write 的系统调用号 64，首先是 syscall enter 阶段收到的 SIGTRAP | 0x80 然后是 PTRACE_EVENT_SECCOMP，最后是 syscall exit 阶段的 SIGTRAP | 0x80。

Proot 原理

基于以上的理论，我们就可以对与 syscall 的调用进行拦截，ptrace 就可以在不同的时机对寄存器进行修改，从而达到对参数与返回值修改的目的，但是其中还是有很多的细节，比较幸运的是我们可以参考一个比较成熟的项目 proot。PRoot 是 chroot、mount --bind 和 binfmt_misc 的用户态实现。用户不需要拥有系统特权就可以在任意目录建立一个新的根文件系统。从而在建立的根文件系统内做任何事情。也可以借助QEMU user-mode甚至能够运行其他CPU构架的程序。

Proot 的核心原理就是利用 ptrace 机制去实现的，我们下面就来分析下整个 Proot 的核心流程：

main

先看 cli.c 中的 main 函数：

int main(int argc, char *const argv[])
{
	Tracee *tracee;
	int status;

	/* Configure the memory allocator.  */
	talloc_enable_leak_report();

#if defined(TALLOC_VERSION_MAJOR) && TALLOC_VERSION_MAJOR >= 2
	talloc_set_log_stderr();
#endif

	/* Pre-create the first tracee (pid == 0).  */
	tracee = get_tracee(NULL, 0, true);
	if (tracee == NULL)
		goto error;
	tracee->pid = getpid();

	/* Pre-configure the first tracee.  */
	status = parse_config(tracee, argc, argv);
	if (status < 0)
		goto error;

	/* Start the first tracee.  */
	status = launch_process(tracee, &argv[status]);
	if (status < 0) {
		print_execve_help(tracee, tracee->exe, status);
		goto error;
	}

	/* Start tracing the first tracee and all its children.  */
	exit(event_loop());

error:
	TALLOC_FREE(tracee);

	if (exit_failure) {
		fprintf(stderr, "fatal error: see `%s --help`.\n", basename(argv[0]));
		exit(EXIT_FAILURE);
	}
	else
		exit(EXIT_SUCCESS);
}

父进程创建第一个 tracee 并在 launch_process 中绑定，同时之后的子进程都会与一个 tracee 绑定并挂在第一个 tracee 下来进行管理，tracee 作为管理进程的结构体，这里可以简单看下：


/* Information related to a tracee process. */
typedef struct tracee {
	/**********************************************************************
	 * Private resources                                                  *
	 **********************************************************************/
	/* tracee的双向链表  */
	LIST_ENTRY(tracee) link;

	/* 进程pid */
	pid_t pid;

	/* 唯一的Tracee标识符 */
	uint64_t vpid;

	/* 是否正在运行  */
	bool running;

	/* 是否准备好释放 */
	bool terminated;

        /* 终止此跟踪是否意味着立即终止所有跟踪。 */
        bool killall_on_exit;

	/* 父级tracee */
	struct tracee *parent;

	/* 它是一个“克隆”吗，即具有与其创建者相同的父级。 */
	bool clone;

	/* 在沙箱中对ptrace进行仿真实现 (tracer side).  */
	struct {
		size_t nb_ptracees;
		LIST_HEAD(zombies, tracee) zombies;

		pid_t wait_pid;
		word_t wait_options;

		enum {
			DOESNT_WAIT = 0,
			WAITS_IN_KERNEL,
			WAITS_IN_PROOT
		} waits_in;
	} as_ptracer;

	/* 在沙箱中对ptrace进行仿真实现 (tracee side).  */
	struct {
		struct tracee *ptracer;

		struct {
			#define STRUCT_EVENT struct { int value; bool pending; }

			STRUCT_EVENT proot;
			STRUCT_EVENT ptracer;
		} event4;

		bool tracing_started;
		bool ignore_loader_syscalls;
		bool ignore_syscalls;
		word_t options;
		bool is_zombie;
	} as_ptracee;

	/* 当前的状态
	 *        0: enter syscall
	 *        1: exit syscall no error 
	 *   -errno: exit syscall with error.  */
	int status;

#define IS_IN_SYSENTER(tracee) ((tracee)->status == 0)
#define IS_IN_SYSEXIT(tracee) (!IS_IN_SYSENTER(tracee))
#define IS_IN_SYSEXIT2(tracee, sysnum) (IS_IN_SYSEXIT(tracee) \
				     && get_sysnum((tracee), ORIGINAL) == sysnum)

	/* 如何重新启动此tracee */
	PTRACE_REQUEST_TYPE restart_how;

	/* tracee 跟踪的通用寄存器的值。  */
	struct user_regs_struct _regs[NB_REG_VERSION];
	bool _regs_were_changed;
	bool restore_original_regs;

	/* 对SIGSTOP的状态进行特殊处理。  */
	enum {
		SIGSTOP_IGNORED = 0,  /* Ignore SIGSTOP (once the parent is known).  */
		SIGSTOP_ALLOWED,      /* Allow SIGSTOP (once the parent is known).   */
		SIGSTOP_PENDING,      /* Block SIGSTOP until the parent is unknown.  */
	} sigstop;

	/* 用于收集所有临时动态内存分配的上下文。  */
	TALLOC_CTX *ctx;

	/* 用于收集释放此tracee后应释放的所有动态内存分配的上下文。  */
	TALLOC_CTX *life_context;

	/* Specify the type of the final component during the
	 * initialization of a binding.  This variable is first
	 * defined in bind_path() then used in build_glue().  */
	mode_t glue_type;

	/* 在子重新配置期间，新设置与@tracee的文件系统名称空间相对应。此外，@paths保存其$PATH环境变量，以模拟execvp（3）行为。  */
	struct {
		struct tracee *tracee;
		const char *paths;
	} reconf;

	/* PRoot在实际系统调用后插入的未请求的系统调用。这是一个系统调用链*/
	struct {
		struct chained_syscalls *syscalls;
		bool force_final_result;
		word_t final_result;
	} chain;

         /*加载运行期间所需要的加载信息*/
	struct load_info *load_info;
	/**********************************************************************
	 * Private but inherited resources                                    *
	 **********************************************************************/

	/* 调试信息详细级别  */
	int verbose;

	/* 这个tracee的seccomp加速状态.  */
	enum { DISABLED = 0, DISABLING, ENABLED } seccomp;

	/* 确保在seccomp下始终命中sysexit阶段。 */
	bool sysexit_pending;


	/**********************************************************************
	 * Shared or private resources, depending on the CLONE_FS/VM flags.   *
	 **********************************************************************/

	/* 与文件系统名称空间相关的信息。 */
	FileSystemNameSpace *fs;

	/* 虚拟堆，使用常规内存映射进行模拟。 */
	Heap *heap;


	/**********************************************************************
	 * Shared resources until the tracee makes a call to execve().        *
	 **********************************************************************/

	/* 执行程序的路径  */
	char *exe;
	char *new_exe;


	/**********************************************************************
	 * Shared or private resources, depending on the (re-)configuration   *
	 **********************************************************************/

	/* Runner command-line.  */
	char **qemu;

	/* guest rootfs和host rootfs用来映射的路径 */
	const char *glue;

	/* 为此tracee启用的扩展列表。  */
	struct extensions *extensions;


	/**********************************************************************
	 * Shared but read-only resources                                     *
	 **********************************************************************/

	/* 对于混合模式，guest LD_LIBRARY_PATH在“guest->host”转换期间保存，以便在“host->client”转换期间恢复（仅当主机LD_LIBRORY_PATH未更改时）。 */
	const char *host_ldso_paths;
	const char *guest_ldso_paths;

	/* 用于诊断目的 */
	const char *tool_name;

} Tracee;

子进程

接下来我们先看子进程的动作，在 launch_process 中子进程

int launch_process(Tracee *tracee, char *const argv[])
{
	char *const default_argv[] = { "-sh", NULL };
	long status;
	pid_t pid;

	/* Warn about open file descriptors. They won't be
	 * translated until they are closed. */
	list_open_fd(tracee);

	pid = fork();
	switch(pid) {
	case -1:
		note(tracee, ERROR, SYSTEM, "fork()");
		return -errno;

	case 0: /* child */
		/* Declare myself as ptraceable before executing the
		 * requested program. */
		status = ptrace(PTRACE_TRACEME, 0, NULL, NULL);
		if (status < 0) {
			note(tracee, ERROR, SYSTEM, "ptrace(TRACEME)");
			return -errno;
		}

		/* Synchronize with the tracer's event loop.  Without
		 * this trick the tracer only sees the "return" from
		 * the next execve(2) so PRoot wouldn't handle the
		 * interpreter/runner.  I also verified that strace
		 * does the same thing. */
		kill(getpid(), SIGSTOP);

		/* Improve performance by using seccomp mode 2, unless
		 * this support is explicitly disabled.  */
		if (getenv("PROOT_NO_SECCOMP") == NULL)
			(void) enable_syscall_filtering(tracee);

		/* Now process is ptraced, so the current rootfs is already the
		 * guest rootfs.  Note: Valgrind can't handle execve(2) on
		 * "foreign" binaries (ENOEXEC) but can handle execvp(3) on such
		 * binaries.  */
		execvp(tracee->exe, argv[0] != NULL ? argv : default_argv);
		return -errno;

	default: /* parent */
		/* We know the pid of the first tracee now.  */
		tracee->pid = pid;
		return 0;
	}

	/* Never reached.  */
	return -ENOSYS;
}

子进程 status = ptrace(PTRACE_TRACEME, 0, NULL, NULL); 并发送 SIGSTOP 信号 kill(getpid(), SIGSTOP); 来告诉父进程自己可以被调试，并且在 enable_syscall_filtering 中配置并开启 seccomp-brf

int enable_syscall_filtering(const Tracee *tracee)
{
	FilteredSysnum *filtered_sysnums = NULL;
	Extension *extension;
	int status;

	assert(tracee != NULL && tracee->ctx != NULL);

	/* Add the sysnums required by PRoot to the list of filtered
	 * sysnums.  TODO: only if path translation is required.  */
	status = merge_filtered_sysnums(tracee->ctx, &filtered_sysnums, proot_sysnums);
	if (status < 0)
		return status;

	/* Merge the sysnums required by the extensions to the list
	 * of filtered sysnums.  */
	if (tracee->extensions != NULL) {
		LIST_FOREACH(extension, tracee->extensions, link) {
			if (extension->filtered_sysnums == NULL)
				continue;

			status = merge_filtered_sysnums(tracee->ctx, &filtered_sysnums,
							extension->filtered_sysnums);
			if (status < 0)
				return status;
		}
	}

	status = set_seccomp_filters(filtered_sysnums);
	if (status < 0)
		return status;

	return 0;
}

可以看到 set_seccomp_filters 中为不同的架构进行不同 brf 规则的设置，注意有一些类似 { *PR_accept*, FILTER_SYSEXIT }, 的东西 FILTER_SYSEXIT 在后面用于标记是否需要处理 syscall_exit 阶段，用于修改返回值的时机。

status = prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0);
if (status < 0)
  goto end;

/* To output this BPF program for debug purpose:
 *
 *     write(2, program.filter, program.len * sizeof(struct sock_filter));
 */

status = prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &program);
if (status < 0)
  goto end;

调用prctl函数，使用PR_SET_NO_NEW_PRIVS选项，以确保在应用过滤器后，进程不能获取新的权限。
调用prctl函数，使用PR_SET_SECCOMP选项和SECCOMP_MODE_FILTER模式，应用构建的BPF程序作为seccomp过滤器。

设置完成后就可以 execvp(tracee->exe, argv[0] != NULL ? argv : default_argv); 中执行指定的命令了。

父进程

回到 main 函数中，父进程的业务逻辑主要在 event_loop 函数中，跟进：

while (1) {
  int tracee_status;
  Tracee *tracee;
  int signal;
  pid_t pid;

  /* This is the only safe place to free tracees.  */
  free_terminated_tracees();

  /* Wait for the next tracee's stop. */
  pid = waitpid(-1, &tracee_status, __WALL);
  if (pid < 0) {
    if (errno != ECHILD) {
      note(NULL, ERROR, SYSTEM, "waitpid()");
      return EXIT_FAILURE;
    }
    break;
  }

  /* Get information about this tracee. */
  tracee = get_tracee(NULL, pid, true);
  assert(tracee != NULL);

  tracee->running = false;

  VERBOSE(tracee, 6, "vpid %" PRIu64 ": got event %x",
    tracee->vpid, tracee_status);

  status = notify_extensions(tracee, NEW_STATUS, tracee_status, 0);
  if (status != 0)
    continue;

  if (tracee->as_ptracee.ptracer != NULL) {
    bool keep_stopped = handle_ptracee_event(tracee, tracee_status);
    if (keep_stopped)
      continue;
  }

  signal = handle_tracee_event(tracee, tracee_status);
  (void) restart_tracee(tracee, signal);
}

在 while 的无限循环中， waitpid 等待子进程的信号，调用 handle_tracee_event 处理 tracee 中的信息，然后 restart_tracee 恢复重启子进程，这里我们忽略 proot 的 extension 拓展机制。我们继续看父进程也就是调试进程是如何处理的：

1 2	if (is_kernel_4_8()) return handle_tracee_event_kernel_4_8(tracee, tracee_status);

可以看到这里会针对 4.8 上下的版本进行不同的操作，也是我们上文提到 4.8 之后 syscall_enter，syscall_exit，PTRACE_EVENT_SECCOMP 顺序是不同的。先看低于 4.8 版本的处理逻辑，这里主要关注暂停信号的处理逻辑：

else if (WIFSTOPPED(tracee_status)) {
/* Don't use WSTOPSIG() to extract the signal
 * since it clears the PTRACE_EVENT_* bits. */
signal = (tracee_status & 0xfff00) >> 8;

switch (signal) {
  static bool deliver_sigtrap = false;

case SIGTRAP: {
  const unsigned long default_ptrace_options = (
    PTRACE_O_TRACESYSGOOD	|
    PTRACE_O_TRACEFORK	|
    PTRACE_O_TRACEVFORK	|
    PTRACE_O_TRACEVFORKDONE	|
    PTRACE_O_TRACEEXEC	|
    PTRACE_O_TRACECLONE	|
    PTRACE_O_TRACEEXIT);

  /* Distinguish some events from others and
   * automatically trace each new process with
   * the same options.
   *
   * Note that only the first bare SIGTRAP is
   * related to the tracing loop, others SIGTRAP
   * carry tracing information because of
   * TRACE*FORK/CLONE/EXEC.  */
  if (deliver_sigtrap)
    break;  /* Deliver this signal as-is.  */

  deliver_sigtrap = true;

  /* Try to enable seccomp mode 2...  */
  status = ptrace(PTRACE_SETOPTIONS, tracee->pid, NULL,
      default_ptrace_options | PTRACE_O_TRACESECCOMP);
  if (status < 0) {
    /* ... otherwise use default options only.  */
    status = ptrace(PTRACE_SETOPTIONS, tracee->pid, NULL,
        default_ptrace_options);
    if (status < 0) {
      note(tracee, ERROR, SYSTEM, "ptrace(PTRACE_SETOPTIONS)");
      exit(EXIT_FAILURE);
    }
  }
}

  /* Fall through. */
case SIGTRAP | 0x80:
  signal = 0;

  /* This tracee got signaled then freed during the
     sysenter stage but the kernel reports the sysexit
     stage; just discard this spurious tracee/event.  */
  if (tracee->exe == NULL) {
    tracee->restart_how = PTRACE_CONT; /* SYSCALL OR CONT */
    return 0;
  }

  switch (tracee->seccomp) {
  case ENABLED:
    if (IS_IN_SYSENTER(tracee)) {
      /* sysenter: ensure the sysexit
       * stage will be hit under seccomp.  */
      tracee->restart_how = PTRACE_SYSCALL;
      tracee->sysexit_pending = true;
    }
    else {
      /* sysexit: the next sysenter
       * will be notified by seccomp.  */
      tracee->restart_how = PTRACE_CONT;
      tracee->sysexit_pending = false;
    }
    /* Fall through.  */
  case DISABLED:
    translate_syscall(tracee);

    /* This syscall has disabled seccomp.  */
    if (tracee->seccomp == DISABLING) {
      tracee->restart_how = PTRACE_SYSCALL;
      tracee->seccomp = DISABLED;
    }

    break;

  case DISABLING:
    /* Seccomp was disabled by the
     * previous syscall, but its sysenter
     * stage was already handled.  */
    tracee->seccomp = DISABLED;
    if (IS_IN_SYSENTER(tracee))
      tracee->status = 1;
    break;
  }
  break;

case SIGTRAP | PTRACE_EVENT_SECCOMP2 << 8:
case SIGTRAP | PTRACE_EVENT_SECCOMP << 8: {
  unsigned long flags = 0;

  signal = 0;

  if (!seccomp_detected) {
    VERBOSE(tracee, 1, "ptrace acceleration (seccomp mode 2) enabled");
    tracee->seccomp = ENABLED;
    seccomp_detected = true;
  }

  /* Use the common ptrace flow if seccomp was
   * explicitely disabled for this tracee.  */
  if (tracee->seccomp != ENABLED)
    break;

  status = ptrace(PTRACE_GETEVENTMSG, tracee->pid, NULL, &flags);
  if (status < 0)
    break;

  /* Use the common ptrace flow when
   * sysexit has to be handled.  */
  if ((flags & FILTER_SYSEXIT) != 0) {
    tracee->restart_how = PTRACE_SYSCALL;
    break;
  }

  /* Otherwise, handle the sysenter
   * stage right now.  */
  tracee->restart_how = PTRACE_CONT;
  translate_syscall(tracee);

  /* This syscall has disabled seccomp, so move
   * the ptrace flow back to the common path to
   * ensure its sysexit will be handled.  */
  if (tracee->seccomp == DISABLING)
    tracee->restart_how = PTRACE_SYSCALL;
  break;
}

第一个 case SIGTRAP: 是用来在首次收到信号时设置 PTRACE_O_TRACESECCOMP 的，在真要触发 BRF 的逻辑时，触发的顺序是：

PTRACE_EVENT_SECCOMP -> syscall_enter_stop -> 调用执行 -> syscall_exit_stop

所以在命中 case 是 case SIGTRAP | PTRACE_EVENT_SECCOMP2 << 8: case SIGTRAP | PTRACE_EVENT_SECCOMP << 8:，可以看到核心逻辑：

status = ptrace(PTRACE_GETEVENTMSG, tracee->pid, NULL, &flags);
if (status < 0)
  break;

/* Use the common ptrace flow when
 * sysexit has to be handled.  */
if ((flags & FILTER_SYSEXIT) != 0) {
  tracee->restart_how = PTRACE_SYSCALL;
  break;
}

/* Otherwise, handle the sysenter
 * stage right now.  */
tracee->restart_how = PTRACE_CONT;
translate_syscall(tracee);

获取 flags 也就是上文说的是否有设置 FILTER_SYSEXIT ，如果没有设置说明不需要处理返回值，那么将 tracee 设置为 PTRACE_CONT 在调用 restart_tracee 时直接跳过 syscall_exit 阶段，直到命中下一个 seccomp 规则，并在 translate_syscall 中处理具体逻辑一般用于处理参数，接下来并不会命中 case SIGTRAP | 0x80；

第二种情况需要处理 FILTER_SYSEXIT，那么 restart_how 设置为 PTRACE_SYSCALL 直接返回重启子进程时就会单步调试，在 case SIGTRAP | 0x80 时 IS_IN_SYSENTER(tracee) 为 true 时继续设置 PTRACE_SYSCALL 然后 Fall through 到 case DISABLED: 中，首先处理 syscall_enter_stop 阶段，之后重启再次触发 0x80 的 case，IS_IN_SYSENTER 为 false，设置 restart_how 为 PTRACE_CONT，让子进程重启时继续到触发 BRF 规则为止，之后同理 Fall through 到 DISABLED 调用 translate_syscall

switch (tracee->seccomp) {
			case ENABLED:
				if (IS_IN_SYSENTER(tracee)) {
					/* sysenter: ensure the sysexit
					 * stage will be hit under seccomp.  */
					tracee->restart_how = PTRACE_SYSCALL;
					tracee->sysexit_pending = true;
				}
				else {
					/* sysexit: the next sysenter
					 * will be notified by seccomp.  */
					tracee->restart_how = PTRACE_CONT;
					tracee->sysexit_pending = false;
				}
				/* Fall through.  */
			case DISABLED:
				translate_syscall(tracee);

				/* This syscall has disabled seccomp.  */
				if (tracee->seccomp == DISABLING) {
					tracee->restart_how = PTRACE_SYSCALL;
					tracee->seccomp = DISABLED;
				}

				break;

			case DISABLING:
				/* Seccomp was disabled by the
				 * previous syscall, but its sysenter
				 * stage was already handled.  */
				tracee->seccomp = DISABLED;
				if (IS_IN_SYSENTER(tracee))
					tracee->status = 1;
				break;
			}
			break;

我们再看一下 4.8 以上处理的方式：

switch (signal) {
  static bool deliver_sigtrap = false;

case SIGTRAP: {
  const unsigned long default_ptrace_options = (
    PTRACE_O_TRACESYSGOOD	|
    PTRACE_O_TRACEFORK	|
    PTRACE_O_TRACEVFORK	|
    PTRACE_O_TRACEVFORKDONE	|
    PTRACE_O_TRACEEXEC	|
    PTRACE_O_TRACECLONE	|
    PTRACE_O_TRACEEXIT);

  /* Distinguish some events from others and
   * automatically trace each new process with
   * the same options.
   *
   * Note that only the first bare SIGTRAP is
   * related to the tracing loop, others SIGTRAP
   * carry tracing information because of
   * TRACE*FORK/CLONE/EXEC.  */
  if (deliver_sigtrap)
    break;  /* Deliver this signal as-is.  */

  deliver_sigtrap = true;

  /* Try to enable seccomp mode 2...  */
  status = ptrace(PTRACE_SETOPTIONS, tracee->pid, NULL,
      default_ptrace_options | PTRACE_O_TRACESECCOMP);
  if (status < 0) {
    seccomp_enabled = false;
    /* ... otherwise use default options only.  */
    status = ptrace(PTRACE_SETOPTIONS, tracee->pid, NULL,
        default_ptrace_options);
    if (status < 0) {
      note(tracee, ERROR, SYSTEM, "ptrace(PTRACE_SETOPTIONS)");
      exit(EXIT_FAILURE);
    }
  }
  else {
    if (getenv("PROOT_NO_SECCOMP") == NULL)
      seccomp_enabled = true;
  }
}
  /* Fall through. */
case SIGTRAP | PTRACE_EVENT_SECCOMP2 << 8:
case SIGTRAP | PTRACE_EVENT_SECCOMP << 8:

  if (!seccomp_detected && seccomp_enabled) {
    VERBOSE(tracee, 1, "ptrace acceleration (seccomp mode 2) enabled");
    tracee->seccomp = ENABLED;
    seccomp_detected = true;
  }

  if (signal == (SIGTRAP | PTRACE_EVENT_SECCOMP2 << 8) ||
      signal == (SIGTRAP | PTRACE_EVENT_SECCOMP << 8)) {

    unsigned long flags = 0;
    signal = 0;

    /* Use the common ptrace flow if seccomp was
     * explicitly disabled for this tracee.  */
    if (tracee->seccomp != ENABLED)
      break;

    status = ptrace(PTRACE_GETEVENTMSG, tracee->pid, NULL, &flags);
    if (status < 0)
      break;

    if ((flags & FILTER_SYSEXIT) == 0) {
      tracee->restart_how = PTRACE_CONT;
      translate_syscall(tracee);

      if (tracee->seccomp == DISABLING)
        tracee->restart_how = PTRACE_SYSCALL;
      break;
    }
  }

  /* Fall through. */
case SIGTRAP | 0x80:

  signal = 0;

  /* This tracee got signaled then freed during the
     sysenter stage but the kernel reports the sysexit
     stage; just discard this spurious tracee/event.  */

  if (tracee->exe == NULL) {
    tracee->restart_how = PTRACE_CONT; /* SYSCALL OR CONT */
    return 0;
  }

  switch (tracee->seccomp) {
  case ENABLED:
    if (IS_IN_SYSENTER(tracee)) {
      /* sysenter: ensure the sysexit
       * stage will be hit under seccomp.  */
      tracee->restart_how = PTRACE_SYSCALL;
      tracee->sysexit_pending = true;
    }
    else {
      /* sysexit: the next sysenter
       * will be notified by seccomp.  */
      tracee->restart_how = PTRACE_CONT;
      tracee->sysexit_pending = false;
    }
    /* Fall through.  */
  case DISABLED:
    translate_syscall(tracee);

    /* This syscall has disabled seccomp.  */
    if (tracee->seccomp == DISABLING) {
      tracee->restart_how = PTRACE_SYSCALL;
      tracee->seccomp = DISABLED;
    }

    break;

  case DISABLING:
    /* Seccomp was disabled by the
     * previous syscall, but its sysenter
     * stage was already handled.  */
    tracee->seccomp = DISABLED;
    if (IS_IN_SYSENTER(tracee))
      tracee->status = 1;
    break;
  }
  break;

同理先处理第一个 SIGTRAP，因为触发顺序是：syscall_enter -> PTRACE_EVENT_SECCOMP -> syscall_exit

因为设置的是 seccomp-BRF，所以首先触发的是 case SIGTRAP | PTRACE_EVENT_SECCOMP2 << 8:case SIGTRAP | PTRACE_EVENT_SECCOMP << 8:，所以当不需要 FILTER_SYSEXIT 时，设置 restart_how 为 PTRACE_CONT，然后 translate_syscall(tracee); 执行 enter 阶段的操作，重启后直到下一个 BRF 规则。

如果需要 FILTER_SYSEXIT 那么 Fall through 到 case SIGTRAP | 0x80: 执行

case ENABLED:
				if (IS_IN_SYSENTER(tracee)) {
					/* sysenter: ensure the sysexit
					 * stage will be hit under seccomp.  */
					tracee->restart_how = PTRACE_SYSCALL;
					tracee->sysexit_pending = true;
				}

继续 Fall Through 到，调用 translate_syscall 对函数调用前的参数进行处理。

case DISABLED:
				translate_syscall(tracee);

				/* This syscall has disabled seccomp.  */
				if (tracee->seccomp == DISABLING) {
					tracee->restart_how = PTRACE_SYSCALL;
					tracee->seccomp = DISABLED;
				}

				break;

重启后因为 PTRACE_SYSCALL 再次触发 case SIGTRAP | 0x80: 同理到

else {
					/* sysexit: the next sysenter
					 * will be notified by seccomp.  */
					tracee->restart_how = PTRACE_CONT;
					tracee->sysexit_pending = false;
				}

继续处理 syscall_exit 阶段，具体每个不同的 syscall 处理逻辑可以自行翻看代码不再赘述，这里已经分析清楚整个的核心逻辑。

总结思考

整个 proot 代码已经十分健壮，我们可以稍加修改就能移植到 Android，后续有时间可以考虑开发，并且伴随着 Android 版本的升级，seccomp-brf 的能力也可以逐渐开启，我们可以看到主流 Android 版本已经支持：

8.1 Oreo	27	3.18.70 4.4.88 4.9.56	4.10.0
9.0 Pie	28	4.4.146 4.9.118 4.14.61	4.15.0
10.0 Q	29	4.9.191 4.14.142 4.19.71	5.0.3

但是考虑到国内手机厂商的可能魔改内核，真正在生产环境使用还是需要考虑周全，不过作为攻击方来进行利用已经足够成熟了，而笔者在这两篇文章中也说明了边界对抗的核心，希望能够进一步的深入思考安全与风控的本质。