Whether you need to implement a kernel rootkit or inspect syscalls for intrusion detection, in a lot of cases, you might need to hijack syscall in a kernel module. This post summorizes detailed procedures and provides a working example for both x86_64 and aarch64 architectures on recent kernel versions. All the code can be found at . Happy hacking~
1. Syscall hijacking
There are different ways to hijack syscall as summerized by . The essense is to modify the sys_call_table within the kernel to overwrite the original address of certain syscall to be the one implemented by yourself. Here we use kallsyms_lookup_name to find the location of sys_call_table. However, 2 more things (or maybe 3 depending on the architecture and we will talk about that later) need to be considered. First, is the page of sys_call_table writable? Recent kernels have enforced read-only (RO) on text pages. So we need to make the page writable again (RW) in our kernel module. Second, SMP environment require us to synchronize the sys_call_table modification with all cores. This can be achieved by disabling preemption.
2. Hijacking read syscall
Once we hijack a certain syscall, we are able to see all the parameters from the user space. For example, we are able to see the file discripter (FD), user buffer, and number of bytes (count) within the read syscall. The real meat of syscall hijacking comes from what we could do using these parameters. As a proof-of-concept (PoC), we trace back the file name from FD and prevent users from reading the specific file by returning something else. In our implementations, we stop users from reading the README.md file (yup) and return bunch of 7s. The good news is we limit our target process to be the testing procedure instead of any process. Since syscall happens within the process context, “current” is always available. Accordingly, intrusion detection, system profiling, and etc are made possible thanks to different syscall parameters.
3. Architecture difference
Architecture makes a difference. Intel has a control bit within CR0 to write-protect the read-only memory on x86_64. As a result, besides adding the W permission to the sys_call_table page, we also need to disable the write protection within CR0. ARM, on the other hand, does not have this constraint. On the aarch64 board with kernel 4.4 that I used, the text page also allows for write.
Nevertheless, in case of page write protection, we will have to need to implemement set_memory_rw and set_memory_ro (for recovery) by ourselves, because none of these functions is exported to kernel modules . Essentially, we need to call apply_to_page_range and implement flash_tlb_kernel_range within our kernel module. This also reminds me a potential bug within the current x86_64 implementation, where a TLB flush should be needed after we update the PTE to synchronize other CPU cores by triggering IPIs.