Ubuntu Kernel Build Again

I wrote two blog posts about Linux kernel build on Ubuntu [1,2]. There is also an official wiki page talking about the same thing [3]. Still, things are broken when I try to create a homework assignment for my class. This post is about how to correctly, easily and quickly build Linux kernel on Ubuntu 16.04 and 18.04. This whole process is used and verified by myself. Happy hacking.

1. Install building dependencies

sudo apt-get build-dep linux linux-image-$(uname -r)
sudo apt-get install libncurses-dev flex bison openssl libssl-dev dkms libelf-dev libudev-dev libpci-dev libiberty-dev autoconf

2. Fetch the Linux kernel source

Make sure /etc/apt/sources.list contains at least these 2 source entries (e.g., uncommenting). Note that the CODENAME is “xenial” for 16.04 and “bionic” for 18.04. Once the source entries are there, update the list file and fetch the kernel source.

deb-src http://us.archive.ubuntu.com/ubuntu CODENAME main
deb-src http://us.archive.ubuntu.com/ubuntu CODENAME-updates main
sudo apt update
sudo apt-get source linux-image-unsigned-$(uname -r)

Note that unlike the typical “linux-image-$(uname -r)”, there is “unsigned” added to fetch the kernel source to work around the bug caused by the kernel signing [4]. You might also wonder why I am not doing git directly. The reason is simple: the apt-get approach gives the exact kernel source files that your system is running now, while git usually provides the most recent version (e.g., master) for certain distro versions. Unless you need to play with cutting-edge features from the most recent kernels, it is wise to stick with the current kernel that proves working on your system. The other benefit of this approach is about generating the config – we could simply reuse the existing config without any trouble.

3. Build the kernel

I do not use the official Ubuntu kernel build procedure [3], which is tedious and moreover does not support incremental build. Note that X should be the kernel version you have downloaded (apt-get), and Y could be the number of cores available in your system for a parallel build.

cd linux-hwe-X
sudo make oldconfig
sudo make -jY bindeb-pkg

4. Install the new kernel

cd ..
sudo dpkg -i linux*X*.deb
sudo reboot

5. Sign kernel modules (optional)

If you are working with Ubuntu 16.04, it is likely that you do not need to deal with kernel module signing. But in case you need to, take a look at [6]. On Ubuntu 18.04, Lockdown is enabled by default to prevent the kernel from loading unsigned kernel modules. A quick and dirty fix is to disable Lockdown to load modules [7]:

sudo bash -c 'echo 1 > /proc/sys/kernel/sysrq'
sudo bash -c 'echo x > /proc/sysrq-trigger'


6. Resize VM disk (optional)

To build a kernel, a VM might need at least 32GB space (tested on Ubuntu 16.04, ). qemu-img is a convenient command to resize your VM image size.

sudo qemu-img resize /PATH_TO_YOUR_VM_IMG_FILE +20G
sudo qemu-img resize --shrink /PATH_TO_YOUR_VM_IMG_FILE -20G
sudo qemu-img info /PATH_TO_YOUR_VM_IMG_FILE

Note that the actual image size change cannot be seen from the host machine using e.g., fdisk. Instead, use qemu-img info to confirm the difference between “virtual size” and “disk size”. The former is changed by the resizing and the later has to be changed as well. We need to boot into the VM and grow the partition.

Once we are inside the VM, use lsblk to confirm the size that we have just grown. In the figure down below, the whole /dev/vda size is 40G but all the partitions only take 20G. We can grow the root partition (/) using parted. Unfortunately, it did not work as you can find in the figure.


Why? Because we could not grow a partition which is not the last. Instead, we need to delete vda5 and vda2 if we wanna grow vda1. Again, parted is your friend here [5]. After we are able to append the extra 20G to the root partition (vda1), we need to fix the partition and resize the filesystem accordingly:

sudo apt-get install cloud-guest-utils
sudo growpart /dev/vda 1
sudo resize2fs /dev/vda1   


[1] https://davejingtian.org/2013/08/20/official-ubuntu-linux-kernel-build-with-ima-enabled/
[2] https://davejingtian.org/2018/03/15/make-deb-pkg-broken/
[3] https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel
[4] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1790880
[5] https://unix.stackexchange.com/questions/196512/how-to-extend-filesystem-partition-on-ubuntu-vm
[6] https://www.kernel.org/doc/html/v4.15/admin-guide/module-signing.html
[7] https://bugzilla.redhat.com/show_bug.cgi?id=1599197

Posted in IDE_Make, Linux Distro, OS | Tagged , , , , , , , , , , , , | 1 Comment

USB Fuzzing: A USB Perspective

Syzkaller [1] starts to support USB fuzzing recently and has already found over 80 bugs within the Linux kernel [2]. Almost every fuzzing expert whom I talked to has started to apply their fuzzing techniques to USB because of the high-security impact and potential volume of vulnerabilities due to the complexity of USB itself. While this post is NOT about fuzzing or USB security in general, I hope to provide some insights for USB fuzzing in general as someone who has been doing research on USB security for a while. Happy fuzzing!

1. Understand USB Stacks

USB is split into two worlds due to the master-slave nature of the protocol: USB host and USB device/gadget. When we talk about USB, it usually refers to the USB host, e.g., a laptop with a standard USB port. The figure below is the Linux USB host stack. From bottom to up, we have the hardware, kernel space, and user space.

From the Syzkaller USB fuzzing slides by Andrey Konovalov [3].

The USB host controller device (aka, HCD) is a PCI device attached to the system PCI bus and provides USB connection supports via USB ports. Depending on the generation of the USB technology, it is also called UHCI/OHCI for USB 1.x, EHCI for USB 2.x, and XHCI for USB 3.x controllers. For kernel to use this controller, we need a USB host controller driver, which sets up the PCI configuration and DMAs. Above it is the USB core, implementing the underlying USB protocol stack (e.g., Chapter 9) and abstracting ways to send/recv USB packets with generic kernel APIs (submit/recv URB). Above it are different USB device drivers, such as USB HID drivers and USB mass storage drivers. These drivers implement different USB class protocols (e.g., HID, Mass Storage), provide glue layers with other subsystems within the kernel (e.g., input and block), facilitate user spaces (e.g., creating /dev node).

Since Linux is also widely used in embedded systems, e.g., some USB dongles, USB device/gadget refers to both the USB dongle hardware and the USB mode within Linux. “Surprisingly”, it is totally different from the USB host mode. The figure down below demonstrates the USB gadget stack within the Linux kernel.

From the Syzkaller USB fuzzing slides by Andrey Konovalov [3].

At the bottom, we have the USB device controller (aka, UDC). Like HCDs, UDCs also implement specific version of the USB standards within the PHY layer. However, unlike the most common HCDs made by Intel, UDC IPs are found from different hardware vendors [8], such as DWC2/3, OMAP, TUSB, and FUSB. These controllers usually have their own design specifications, and might follow the HCD specification (e.g., XHCI specification) as well when they support USB On-The-Go (aka, OTG) mode. OTG allows a UDC to switch between USB host and USB device/gadget modes. For example, when an Android device connects with a laptop as MTP, the Android USB device controller is in the USB device/gadget mode. If a USB flash drive is plugged into an Android device, the UDC works in the USB host mode. A UDC supporting OTG is also replaced by a Dual-Role Device (DRD) controller in USB 3.x standards [11]. As a result, an OTG cable is not needed to switch the role of the UDC, since the role switching is done in software for a DRD controller.

To use a UDC, you need a UDC driver within the kernel, providing connection and configuration over industry-standard buses including both the AMBA™ AHB and AXI interfaces, and setting up DMAs for the higher layer. Like the USB core within USB host stack, the USB gadget core within USB gadget stack provides APIs to register and implement a USB gadget function via callbacks and configfs. For instance, we can pass USB descriptors to the USB gadget core and achieve a typical USB mass storage device by requesting the existing mass storage function (f_mass_storage). For more complicated protocols such as MTP, a user-space daemon or library provides the protocol logic and communicates with the gadget function via e.g., configfs or usbfs.

2. Where We Are

USB fuzzing started to attract more attention thanks to the FaceDancer [4], a programmable USB hardware fuzzer. It supports both USB host and device/gadget mode emulation and allows sending out pre-formed or mal-formed USB requests and response. Umap/Umap2 [5] provides a fuzzing framework written in Python with the different USB device and response templates for the FaceDancer. The TTWE framework [9] enables MitM between a USB host and a USB device by using 2 FaceDancers emulating the USB host and device/gadget, respectively. This MitM allows USB packet mutations for both directions, thus enables fuzzing on both sides.

All these solutions focus on the USB host stack due to the facts that people assume a malicious USB device rather than a malicious USB host, e.g., a laptop, and that most USB device firmware is closed source and thus hard to analyze. Accordingly, most of the bugs/vulnerabilities are found within the USB core (for parsing USB response) and some common USB drivers (e.g., keyboard). The pros of these solutions are their ability to faithfully emulate a USB device. However, the problems, in my opinions, are:

a. Hardware dependency.
b. Limited feedback from the target.

FaceDancer is slow, which makes any solution built upon it not scale. The fact that we need both a FaceDancer and a target machine as the minimum to start fuzzing also imposes more challenges for scalability. Feedback is the other big issue here. Mutations of the fuzzing input are based on templates and randomizations without real-time feedback from the target (e.g., code coverage) except system logging. Thus, fuzzing efficiency is questionable. As a result, these solutions are “best-effort” to find some bugs with a minimum setup effort.

To get rid of the hardware dependency, virtualization (e.g., QEMU) comes to save. vUSBf [6] uses QEMU/KVM to run a kernel image and leverages the USB redirection protocol within QEMU to redirect the access to USB devices to a USB emulator controlled by the fuzzer, as shown below:

From the vUSBf paper [6].

While vUSBf provides a nice orchestration architecture to run multiple QEMU instances in parallel solve scalability issues, the fuzzer itself is essentially based on templates (or test cases according to the paper. The feedback still relies on system logging. POTUS [7] moves a step forward to leverage symbolic execution, e.g., S2E, to inject faults from the USB HCD layer, as shown below:

From the POTUS paper [7].

SystemTap is used to instrument the kernel to inject fault and annotations to save the number of faults. A path prioritization algorithm based on the number of faults within different states is used to control the number of “forks”. The number of faults of a given path represents the code coverage. Thus, a significant number of faults represents a high code coverage. POTUS also implements a generic USB virtual device within QEMU to emulate different USB devices using configurable device descriptors and data transfer. The Driver Exerciser within the VM uses syscalls to play with different device node exposed to the VM. Comparing to vUSBf, POTUS includes a fuzzing feedback mechanism (by counting the number of faults within a path) and support more USB device emulations. However, the manual effort to emulate operations on certain USB devices within the Driver Exerciser, fundamental limitations of symbolic executions – path explosion, and the unknown effectiveness and limitations of relying on the number of faults of a path for path scheduling, make POTUS hard to evaluate in the real world.

Syzkaller [1] USB fuzzing support [12] was added recently by Andrey Konovalov at Google., and has demonstrated its ability to find more bugs. Andrey solved two main problems of using Syzkaller to fuzz USB:

a. Code coverage for kernel tasks.
b. Device emulation within the same kernel image.

Since USB events and operations happen within IRQ or kernel context rather than a process context (e.g., USB plugging detection within khub kernel task in older kernels), syscall-based tracing and code coverage [10] simply won’t work. We need the ability to report the code coverage anywhere within the kernel. To do that, we need to annotate the USB-related kernel source (e.g., hub.c) with the extended KCOV kernel APIs to report the code coverage [13]. Instead of relying on QEMU, Syzkaller uses the gadgetfs to expose a fuzzer kernel driver to the user space [14], which can then manipulate the input for fuzzing. By enabling both the USB host stack and the USB gadget stack in the kernel configuration and connecting them using the dummy HCD and UDC drivers together as shown below, Syakaller is able to fuzz USB host device drivers, such as USB HID, Mass Storage, etc., from the user-space fuzzing the USB fuzzer kernel driver.

From the Syzkaller USB fuzzing slides by Andrey Konovalov [3].

Syzkaller USB fuzzer might be the first real coverage-based USB host device driver fuzzer thanks to the existing Syzkaller infrastructure and nice hackings to bridge the USB host and gadget the same time. While it has found tons of bugs and vulnerabilities, the limitation of the fuzzer starts to reveal – most of the issues found were in the initialization phase of a driver (e.g., probing). In the user space, the fuzzer is able to configure the fuzzer kernel driver to represent any USB device/gadget by exploring different VID/PID combinations within the USB device descriptor. On the one hand, Syzkaller is able to trigger almost every USB host device driver to be loaded, thus having a prominent code coverage horizontally. On the other hand, since no real emulation code for a certain device is provided within the user-space fuzzer or the fuzzer kernel driver, most fuzzing stops after the initialization of a driver, thus covering only a small portion of the driver vertically.

3. What To Be Done

Cautions readers might have found it already – all these fuzzing solutions focus on the USB host stack, especially on the USB host device drivers. Again, this is due to the facts that people often refer USB to the USB host stack, and that these device drivers are famous for containing more vulnerabilities than other components within the kernel (e.g., device drivers on Windows). However, at this point, I believe you have realized that what has been covered on USB fuzzing so far is the tip of the iceberg, either horizontally or vertically. Let’s enumerate what to be done next.

a. HCD drivers fuzzing

If we limit ourselves within the USB host stack, it is interesting to find HCD drivers are ignored. Unlike device drivers, they are not accessible from the user space via syscall (except tunning some parameters using sysfs). Instead, they receive inputs from the USB core (e.g., usb_submit_urb) in the upper layer (internal) and DMAs of the HCD layer (external). From a security perspective, external inputs should impose more threats than the internal ones.

To directly fuzz the internal inputs of HCD drivers, we need the ability to mutates the parameters of kernel APIs exposed to the USB core and get the code coverage from the HCD driver. To directly fuzz the external inputs of HCD drivers, we need to mutate on DMA buffers and event queues, as well as the code coverage from the HCD driver. Note that the code coverage is often different in these two cases because of different code paths for TX and RX. Thus we need a fine-granularity code coverage reporting to reflect this. Mutating on DMA buffers and event queues is essentially building an HCD emulator with fuzzing capabilities. For common HCD drivers such as Intel XHCI, QEMU provides the corresponding HCD emulation already (e.g., qemu/hw/usb/hcd-xhci.c), and one can try to add fuzzing functionality there. For other HCD drivers that QEMU does not provide the HCD emulation, one needs to build the HCD emulation from scratch.

b. USB device/gadget stack fuzzing

Yes, we do not have systematic fuzzing on the USB device/gadget stack. It used to be OK since we often assume a malicious USB device rather than a malicious host. However, the broad adoption of USB OTG and DRD controllers in embedded systems (e.g., Android devices) extends the threat model to include USB host as well. For example, no one wants their phones to be hacked during USB charging. Architecturally, Syzkaller USB fuzzer has imagined a way to fuzz the USB device/gadget stack, as shown below.

From the Syzkaller USB fuzzing slides by Andrey Konovalov [3].

Instead of having the user-space fuzzer communicating with the USB fuzzer kernel driver, another user-space fuzzer will manipulate USB host device drivers. The fuzzer activities will be propagated via the USB host stack to the USB device/gadget stack, hopefully. Accordingly, we need to configure the kernel to enable all different gadget functions within the same kernel image, as well as code coverage reporting. We are then able to fuzz the USB gadget core and USB gadget function drivers, except UDC drivers.

Note that Syzkaller imagines one way to fuzz the USB device/gadget stack, a natural result due to the architecture and limitations of Syzkaller. Syzkaller is a syscall fuzzer, meaning that the input mutations happen from syscall parameters. But this does not mean we have to fuzz at the syscall layer (e.g., user-space). If we look at the figure above again, we can find a long path from the fuzzer to the fuzzing target (e.g., USB host device drivers or USB device/gadget drivers). How could we know if all the fuzzing inputs are successfully propagated to the target instead of being filtered by middle layers in between? One core question is if syscall-based fuzzing is suitable for USB fuzzing within the kernel. Again, the Syzkaller USB fuzzer accommodates itself to the constraints of the Syzkaller itself instead of thinking about building a USB fuzzer (e.g., the USB host or gadget stacks) from scratch.

To shorten the fuzzing path is to push the fuzzer (inputs) close to the target. For example, we could get rid of the whole USB host stack by building a USB UDC emulator/fuzzer within QEMU, directly enabling UDC driver fuzzing. However, this does not mean any DMA write can be translated into a valid USB request to the USB gadget drivers in the upper layer. As a result, the fuzzing path is indeed shorter, but we still hope that the mutation algorithm and the code coverage granularity will save us. In the end, we might need different fuzzers for different layers within the stack, making sure all fuzzing inputs applied to the target without being filtered. E.g., we may need to build a USB host emulator/fuzzer sending out USB requests to different USB gadget drivers directly.

c. Android USB fuzzing

Android might be the heaviest USB device/gadget stack user via maintaining its own branch of the kernel and implementing extra USB gadget function drivers (e.g., MTP). The OTG/DRD support within Android devices also doubles the attacking surface comparing to a typical USB host machine. The fundamental challenge is to run an Android kernel image with the corresponding UDC/DRD drivers used by real-world Android devices using QEMU. Running non-AOSP kernels in QEMU imposes extra difficulties due to SoC customizations and variations. That’s why a lot of Android fuzzing still requires a physical device.

d. Protocol-guided/Stateful fuzzing

In section b, we talked about why we might wanna shorten the fuzzing path because we wanna avoid fuzzing inputs being filtered before hitting the target. Turns out it is much more complicated here. If we look at the figure above on USB device/gadget fuzzing imagined in Syzkaller again, the fuzzer inputs start with syscalls, pass different USB host device drivers before finally delivered to the USB gadget stack. Yes, the fuzzing path is long, and the fuzzing input could be filtered along the way. Meanwhile, these extra layers in between guarantee that whatever fuzzing inputs sent out is legitimate USB request carrying the corresponding protocol payload triggered by the right driver state. For example, the final USB request generated by the fuzzer via the USB mass storage driver might contain a legitimate SCSI command (e.g., read) triggered by the core logic of the USB host device driver rather than the initialization part.

This is what I call the “protocol-guided/stateful” fuzzing. As you can tell, it is essential to go “deeper” vertically within a layer, e.g., exploring other parts of a kernel driver beyond the initialization/probing phase. Simply put, to fuzz either USB host or device/gadget drivers, we need to establish a virtual connection with the target (e.g., making sure the kernel driver initialized and ready to process inputs) – stateful, and teach the fuzzer to learn the structure of the input (e.g., SCSI protocol in USB Mass Storage) – protocol-guided. In the end, it is a trade-off between including other layers to reuse the existing protocol and state controls thus increasing the fuzzing path and complexity and implementing a light-weight protocol-aware/stateful fuzzing within the fuzzer directly to reduce the fuzzing path. Both have their pros and cons.

e. Type-C/USBIP/WUSB fuzzing

There are more things in USB other than the USB host and USB device/gadget, including USB Type-C, USBIP, WUSB, etc. While we could reuse some of the lessons learns in USB fuzzing, these technologies introduce different software stacks and may require different attentions to solve their quirks.

4. Summary

This post looks into USB fuzzing, a recent hot topic from both software security and operating system security. Instead of treating USB as another piece of software, we start with understanding what USB stacks are and why USB refers to a bigger picture than what people often imagined. We survey some previous works on USB fuzzing, from using specialized hardware to running QEMU. We conclude with what is missing and foresee the future.

P.S. This blog post was long overdue. I promised to have it a month ago but never made it. I also underestimated how much time needed to finish it. A lesson learns (again, for myself) is to start early and focus. Anyway, better delay than nothing:)


[1] https://github.com/google/syzkaller
[2] https://github.com/google/syzkaller/blob/e90d7ed8d240b182d65b1796563d887a4f9dc2f6/docs/linux/found_bugs_usb.md
[3] https://docs.google.com/presentation/d/1z-giB9kom17Lk21YEjmceiNUVYeI6yIaG5_gZ3vKC-M/edit?usp=sharing
[4] http://goodfet.sourceforge.net/hardware/facedancer21/
[5] https://github.com/nccgroup/umap2
[6] https://github.com/schumilo/vUSBf
[7] https://www.usenix.org/conference/woot17/workshop-program/presentation/patrick-evans
[8] https://elinux.org/Tims_USB_Notes
[9] https://www.usenix.org/conference/woot14/workshop-program/presentation/van-tonder
[10] https://davejingtian.org/2017/06/01/understanding-kcov-play-with-fsanitize-coveragetrace-pc-from-the-user-space/
[11] https://blogs.synopsys.com/tousbornottousb/2018/05/03/usb-dual-role-replaces-usb-on-the-go/
[12] https://github.com/google/syzkaller/commit/e90d7ed8d240b182d65b1796563d887a4f9dc2f6
[13] https://github.com/xairy/linux/commit/ff543afbf78902acea566fa4c635240ede651f77
[14] https://github.com/xairy/linux/commit/700fb65580efc049133628e7b9f65453bb686231

Posted in Security | Tagged , , , , , , , , , , | 3 Comments

Speculations on Intel SGX Card

One of the exciting things Intel has brought to RSA 2019 is Intel SGX Card [2]. Yet there is not much information about this coming hardware. This post collects some related documentation from Intel and speculates what could happen within Intel SGX Card with a focus on software architecture, cloud deployment, and security analysis. NOTE: all the figures come from public Intel blog posts and documentation, and there is no warranty for my speculations on Intel SGX Card! Read with caution!

1. Intel SGX Card

According to [2], “Though Intel SGX technology will be available on future multi-socket Intel® Xeon® Scalable processors, there is pressing demand for its security benefits in this space today. Intel is accelerating deployment of Intel SGX technology for the vast majority of cloud servers deployed today with the Intel SGX Card. Additional benefits offer access to larger, non-enclave memory spaces, and some additional side-channel protections when compartmentalizing sensitive data to a separate processor and associated cache.

Simply put, Intel SGX Card is introduced to address 3 problems on SGX usage within cloud:

  1. Older servers/CPUs that do not support SGX
  2. Small EPC memory pool
  3. Side-channel attacks

Accordingly, Intel SGX Card is designed as a PCIe card, which can be plugged into old servers. This solves the first problem. But what about the second and the third problems? How could Intel SGX Card have larger EPC memory pool and defend against side-channel attacks? To answer these questions, we need to look into the internals of Intel SGX Card.

2. Intel VCA

According to [1], Intel SGX Card is actually built upon Intel VCA, the Intel® Visual Compute Accelerator (Intel® VCA) card [3]. Moreover, “Intel VCA is a purpose-built accelerator designed to boost performance of visual computing workloads like media transcoding, object recognition and tracking, and cloud gaming, originally developed as a way to improve video creation and delivery. In the Intel® SGX Card, the graphics accelerator has been disabled and the system re-optimized specifically for security purposes. In order to take advantage of Intel SGX technology, three Intel Xeon E processors are hosted in the card, which can fit inside existing, multi-socket server platforms being used in data centers today.

Alright, so Intel SGX Card is Intel VCA with graphics accelerator disabled essentially. Now it is time to learn what Intel VCA is. After some digging online, I found 2 precious documentations describing hardware specification [4] and software guide [5] respectively. Readers are encouraged to give a careful read on these documentations. Below is the TL;DR version.


The Intel VCA (or VCA 2) is a PCIe card with 3 Xeon CPUs. As shown in the figure above, each CPU has its own DRAM, instead of sharing RAMs. The internal architecture below shows better the nature of this card: 3 computers within a PCIe card.


These 3 CPUs do not only have their own DRAMs but also PCH chipsets and Flashes. They are connected and multiplexed by a PCIe bridge connecting with the host machine. Note that VCA 2 also supports optional NVM storage M2, as shown in the figure above. Let’s take a look at the software stack.


Did I say “3 computers within a PCIe card”? I actually mean it. Each CPU within the VCA card runs its own software stack, including UEFI/BIOS, operating system, drivers, SDKs, and applications. These operating systems could be Linux or Windows. Hypervisors are also supported including KVM and Xen. Even “better”, each CPU is also equipped with Intel SPS and ME. If you count ME as a microcomputer as well, now we have 3 microcomputers running inside 3 computers within 1 PCIe card.


Each computer within VCA is also called a node. Therefore, there are 3 nodes within 1 VCA card. Unlike typical PCIe cards, VCA exposes itself as virtual network interfaces to the host machine. For example, 2 VCA cards (6 nodes) add 6 different Virt eth interfaces to the host machine, as shown in the figure above. These Virt eth interfaces are implemented as MMIO over PCIe. Given that each node is indeed an independent computer system with full software stacks, this virtual network interface concept might be a reasonable abstraction. I was worried about the overhead of going through TCP/IP stack. Then I realize that Intel could provide dedicated drivers on both the host and the node side to bypass the TCP/IP stack, which is very possible, as suggested by those VCA drivers. It would be interesting to see what “packet” is sent and received from these virtual NICs. To support high bandwidth and throughput, the MMIO region is 4GB minimum. This means each node takes a 4GB memory space from the main system memory, as well as its internal memory.

3. Speculations on Intel SGX Card

Once we have some basic understanding of Intel VCA, we can now speculate what Intel SGX Card could be. Depending on what Intel meant by “disabling graphics accelerators“, it could be removing those VCA drivers and SDK within each node. Once we did that, we would have a prototype Intel SGX Card, where 3 SGX nodes run a typical operating system connecting with the host machine via PCIe. Now, what could we do?

To reuse most of the software stack developed for VCA already, I probably would keep the virtual network interface instead of creating a different device within the host machine. As such, the host still talks with the SGX card in virt eth. Within each node of the SGX card, we could install the typical Intel SGX PSW and SDK without any trouble since each node is an SGX machine. Then each node has all the necessary runtime to support SGX applications. On the host side, we could still install Intel SGX SDK to support compilation “locally”, although we might not be able to install PSW assuming an old Xeon processor. But this is not a problem because we will relay the compiled SGX application to the SGX card. To achieve this, a new SGX kernel driver is needed on the host machine to send the SGX application to one node within the SGX card via the virt eth interface.

So far we have speculated how to use Intel SGX card within a host (or server). It is time to review the design goals of Intel SGX card again:

  1. Enable older servers to support SGX
  2. Enlarge EPC memory pool
  3. Protect from side-channel attacks

The first problem can be achieved easily with the PCIe design and the fact that each node within the Intel SGX card is a self-contained SGX-enabled computer. However, the scalability of this solution is still limited by the number of PCIe (x16) slots available within a server and the number of CPU nodes within an Intel SGX Card. The number of PCIe slots is also limited by the power supply within the system. Unless we are talking about some crazy GPU-in-favor motherboard [6], 4 PCIe x16 slots seem to be a reasonable estimation. Multiplied by 3 (number of nodes within an Intel SGX card), we would have 12 SGX-enabled CPU nodes available within a server.

The second goal is a byproduct of the independent DRAM of each node within the Intel SGX card. Recall that each node has a maximum 32GB memory available. If Intel SGX card is based upon Intel VCA 2, each node then has maximum 64GB memory available. Because this 32GB (or 64GB) memory is dedicated to the node for SGX computation instead of a portion from the main system memory within the server, we can anticipate the EPC to be large for each node. For instance, a typical EPC memory size within an SGX-enable machine is 128MB. Because of the Merkle Tree used to maintain the integrity of each page and other housekeeping metadata, only around 90MB is for real enclave allocations. This means the overhead of EPC is 1/4 in general. If we assume 32GB for each node within an Intel SGX card, we could easily have 16GB for EPC, among of which 4GB is used for EPC management and 12GB for enclave allocations. Why 16GB? You might ask. Well, remember that each node is a running system. We need some memory both for OS and applications, including the non-enclave part of SGX applications. Moreover, due to the MMIO requirement, a 4GB memory space is reserved on both the main system memory and node’s memory for each node. As a result, we have roughly 12GB left for OS and applications for each node. Of course, we could push more but you get the point. We will see the EPC size once Intel SGX card is available.

The third goal is described as “additional benefit” of using Intel SGX card. Because all the 3 nodes within an Intel SGX card have its independent RAM and cache (which are also separated from the main system if the host supports SGX as well), we definitely could have better security guarantees for SGX applications. First, SGX applications can run within a node, thus isolating themselves from other processes running on the main system. Second, different SGX applications can run on different nodes, thus reducing the impact of enclave-based malware or side-channel attacks. Everything sounds good! What could possibly go wrong?

4. Speculations on security

First of all, SGX applications running within Intel SGX card is still vulnerable to whatever attacks as before, because each node within the card is still a computer system with a full software stack. Unless this whole software stack is within the TCB, an SGX application is still vulnerable to attacks from all other processes and even the OS or hypervisors running within the same node. From SGX application point of view, nothing is changed, really.

The other question is how a cloud service provider (CSP) could distribute SGX workload? A straightforward solution would be based on load balancing, where a CSP distributes different SGX applications to different nodes for performance considerations regardless of security levels of different end users. Again, this is no different with an SGX-enabled host machine running different SGX applications from different users. Another solution would be mapping a node with one user, meaning that SGX applications from the same user will run within the same node. While this solution reduces attacks from other end users, we can easily run into scalability issues given the limited number of nodes available within a system and a possibly large number of end users. The other problem of this solution would be load unbalancing. User A might only have 1 SGX applications running on node N-A while user B might have 100 SGX applications running on node N-B. I am not surprised if user B yells at the cloud.

That is being said I do not think Intel would take either approach. Instead, a VM-based approach might be used, where SGX applications from the same user run within the same VM and different users have different VMs. We can then achieve load balancing easily by assigning a similar number of VMs to each node. This approach is technically doable since we have seen SGX support for KVM [7] and nodes within Intel SGX card support KVM too. It is also possible that Clear Linux [8] will be used to reduce the overhead of VM by using KVM-based containters. The only question is if VM or container is enough to isolate potential attacks from other cloud tenants, e.g., cache-based attacks, and defend against attacks from OS and hypervisors, e.g., control-channel attacks.

5. Conclusion

This post tries to speculate what Intel SGX card would look like and how it would be used within a cloud environment. I have no doubt that some of the speculations could be totally wrong once we are able to see the real product. Nevertheless, I hope this post could shed some light on this new security product and what could/should be done and what is still missing. All opinions are my own.


[1] https://itpeernetwork.intel.com/sgx-data-protection-cloud-platforms/
[2] https://newsroom.intel.com/news/rsa-2019-intel-partner-ecosystem-offer-new-silicon-enabled-security-solutions/
[3] https://www.intel.com/content/www/us/en/products/servers/accelerators.html
[4] https://www.intel.com/content/dam/support/us/en/documents/server-products/server-accessories/VCA_Spec_HW_Users_Guide.pdf
[5] https://www.intel.com/content/dam/support/us/en/documents/server-products/server-accessories/VCA_SoftwareUserGuide.pdf
[6] https://www.pcgamer.com/asus-has-a-motherboard-that-supports-up-to-19-gpus/
[7] https://github.com/intel/kvm-sgx
[8] https://clearlinux.org/

Posted in Security | Tagged , , , , , | Leave a comment

Syscall hijacking in 2019

Whether you need to implement a kernel rootkit or inspect syscalls for intrusion detection, in a lot of cases, you might need to hijack syscall in a kernel module. This post summorizes detailed procedures and provides a working example for both x86_64 and aarch64 architectures on recent kernel versions. All the code can be found at [1]. Happy hacking~

1. Syscall hijacking

There are different ways to hijack syscall as summerized by [3]. The essense is to modify the sys_call_table within the kernel to overwrite the original address of certain syscall to be the one implemented by yourself. Here we use kallsyms_lookup_name to find the location of sys_call_table. However, 2 more things (or maybe 3 depending on the architecture and we will talk about that later) need to be considered. First, is the page of sys_call_table writable? Recent kernels have enforced read-only (RO) on text pages. So we need to make the page writable again (RW) in our kernel module. Second, SMP environment require us to synchronize the sys_call_table modification with all cores. This can be achieved by disabling preemption.

2. Hijacking read syscall

Once we hijack a certain syscall, we are able to see all the parameters from the user space. For example, we are able to see the file discripter (FD), user buffer, and number of bytes (count) within the read syscall. The real meat of syscall hijacking comes from what we could do using these parameters. As a proof-of-concept (PoC), we trace back the file name from FD and prevent users from reading the specific file by returning something else. In our implementations, we stop users from reading the README.md file (yup) and return bunch of 7s. The good news is we limit our target process to be the testing procedure instead of any process. Since syscall happens within the process context, “current” is always available. Accordingly, intrusion detection, system profiling, and etc are made possible thanks to different syscall parameters.

3. Architecture difference

Architecture makes a difference. Intel has a control bit within CR0 to write-protect the read-only memory on x86_64. As a result, besides adding the W permission to the sys_call_table page, we also need to disable the write protection within CR0. ARM, on the other hand, does not have this constraint. On the aarch64 board with kernel 4.4 that I used, the text page also allows for write.

Nevertheless, in case of page write protection, we will have to need to implemement set_memory_rw and set_memory_ro (for recovery) by ourselves, because none of these functions is exported to kernel modules [3]. Essentially, we need to call apply_to_page_range and implement flash_tlb_kernel_range within our kernel module. This also reminds me a potential bug within the current x86_64 implementation, where a TLB flush should be needed after we update the PTE to synchronize other CPU cores by triggering IPIs.


[1] https://github.com/daveti/syscallh
[2] https://blog.trailofbits.com/2019/01/17/how-to-write-a-rootkit-without-really-trying/
[3] https://lxr.missinglinkelectronics.com/linux/arch/arm64/mm/pageattr.c

Posted in OS, Security | Tagged , , , , | 1 Comment

Kernel build on Nvidia Jetson TX1

This post introduces native Linux kernel built on the Nvidia Jetson TX1 dev board. The scripts are based on the jetsonhacks/buildJetsonTX1Kernel tools. Our target is JetPack 3.3 (the latest SDK supporting TX1 by the time of writing). All the scripts are available at [2]. Have fun~

1. Kernel build on TX1

Nvidia devtalk has some general information about kernel build for TX1 [3], including both native build and cross compile (e.g., from a TFTP server). Here we focus on the native build. The procedure roughly follows a) installing dependencies, b) downloading the kernel src, c) generating config, d) making build, and e) installing the new kernel image.

Unlike a typical kernel build on x86-64 architecture, the most confusing part would be to figure out the right kernel version supported by the board. TX1 uses Nvidia L4T [5], which is a customized kernel for the Tegra SoC. Depending on the JetPack version running on your TX1 board, different L4T version is needed. As you can tell, there are a lot of prepairations needed to be done before we could kick off the build.

2. buildJetsonTX1Kernel

JetsonHacks provides a bunch of scripts to ease and automate differen steps mentioned above, called buildJetsonTX1Kernel [1]. By detecting the tegra chip id (sysfs) and tegra release note (/etc), these scripts can figure out the model of the board (e.g., TX1) and the version of JetPack installed (e.g., 3.2), thus download the right version of L4T kernel source. Please refer to [4] for a detailed usage of these scripts.

3. One-click build

The buildJetsonTX1Kernel scripts are great and useful, but somehow I realized that my TX1 setup was different and I needed some customizations to make my life (hopefully yours too) easier [2]. The first issue was the usage of JetPack 3.3. I have submitted a patch to JetsonHacks for JetsonUtilities to correctly detect this latest JetPack version supported by TX1. Unfortunately, buildJetsonTX1Kernel scripts still only support up to JetPack 3.2. Things get more complicated when both JetPack 3.2 and 3.3 use the same L4T kernel version.

The original scripts assume the usage of eMMC to hold all the kernel build artifacts, which does not hold in my TX1 environment where a 64G SD card is mounted. Accordingly, I have updated all the scripts to use my SD card instead of the default /usr/src/ directory.

I have also created a one-click build script (kbuild.sh) to automate the whole process within one script. Simply running ./kbuild.sh would generate a new kernel image ready to reboot. I have also replaced xconfig with menuconfig since I use SSH to connect with TX1. A simple hello world kernel module is also included as a starting point for module development.


[1] https://github.com/jetsonhacks/buildJetsonTX1Kernel
[2] https://github.com/daveti/buildJetsonTX1Kernel
[3] https://devtalk.nvidia.com/default/topic/762653/-howto-build-own-kernel-for-jetson-tk1/
[4] https://www.jetsonhacks.com/2018/04/21/build-kernel-and-modules-nvidia-jetson-tx1/
[5] https://developer.nvidia.com/embedded/linux-tegra

Posted in Embedded System, gpu, OS | Tagged , , , , , , , | Leave a comment

Setting up Nvidia Jetson TX1

Starting from this post, I will share my learning and hacking experience on Nvidia Jetson TX1 dev board. This post discusses the installation issue of JetPack [4] and post-installation configurations for TX1. We assume users follow the JetPack 3.3 installation guide to setup the TX1.

1. DHCP Issue

One of the two possible configurations to setup JetPack on TX1 is to use DHCP, where the host machine is the DHCP server and the TX1 is the client. This connecting model is needed when there is no switch available and only the host machine has the Internet connection. In my case, the host machine connects with the Internet with Wifi and the Eithernet port is used to connect with TX1. Everything looks fine until TX1 tries to get an IP address from the host. “can’t determine the target IP” will be returned from the terminal and all the following JetPack installation on the TX1 will fail (although we have already flashed the L4T to the TX1 successfully). Turns out this is a known bug due to the argument changes within nmcli between Ubuntu 14.04 and 16.04 [1]. A detailed workaround is also provided there:


Although the issue was reported on JetPack 3.2 for TX2, JetPack 3.3 still has this issue on TX1. JetPack 4.0 hopefully would fix this configuration bug.

2. Mount SD Card

TX1 comes with 16G eMMC storage. After full installation of JetPack, only 5.3G is left. As a result, we need extra storage to do something useful, e.g., compiling the Linux kernel on TX1. Again, the devtalk forum has a good discussion [2]. I used gparted to partition and format a 64G SD card with EXT4. Then find the UUID using blkid. Once we have the UUID for the new partition, we can put it into /etc/fstab for auto mounting.


3. Setup An Account

After full deployment of JetPack on TX1, we have 2 accounts ready for use “ubuntu/ubuntu” and “nvidia/nvidia”. We can use the later one to do CUDA development. However, to support multiple users on the board, we need to create new users using adduser. The first thing after logging as a new user on TX1 might be “nvcc not found” – Duh! Since “nvidia” has CUDA environment setup already, let’s copy its .bashrc and .profile into the new account. We can then compile CUDA program using nvcc. But when we run the CUDA program, it is seg fault – “unhandled level 3 permission fault (11)”:


Turns out all the GPU device files under /dev (/dev/nvhost-*) belong to group “video”, and id on “nvidia” shows this group as well. Adding the new user into “video” group (sudo usermod -aG video newuser) solves this permission issue.


[1] https://devtalk.nvidia.com/default/topic/1023680/jetson-tx2/dhcp-not-working-and-no-internet-connection-from-tx2-after-installing-/1
[2] https://devtalk.nvidia.com/default/topic/1009267/jetson-tx2/mount-sd-card-into-jetson/
[3] https://devblogs.nvidia.com/even-easier-introduction-cuda/
[4] https://developer.nvidia.com/embedded/jetpack

Posted in Embedded System, gpu | Tagged , , , , , , , | Leave a comment

Hacking Valgrind

This post talks about 3 commits I have recently added into my own valgrind tree [1], including the support for fsgsbase instructions, rdrand/rdseed instructions, and adding a new trapdoor (client request) to support gdb-like add-symbol-file command. Note that all these new features are not available in the mainstream valgrind by the time of writing, and I am not planning to work on mainstreaming anyway. Nevertheless, feel free the patch your own valgrind if needed. My work is supported by Fortanix [5].

1. Support for fsgsbase

fsgsbase instructions allow user space to read [6] or write [7] the FS or GS register base on x86_64 architecture, enabling indirect addressing mode using FS/GS, such as “mov %GS:0x10, %rax”. Surprisingly, the most challenging part (for me) was the decoding of amd64/x86_64 instructions. I am not interested in repeating how fucked-up this encoding mechanism is but only remind readers that opcode is USELESS on this architecture. Anyway, once we figure how to decode fsgsbase instructions in valgrind, we are able to generate the corresponding VEX IRs.

Although FS/GS base update from the user space is not supported, valgrind has FS/GS base registers built inside the guest VM state. Valgrind even hooks arch_prctl() syscall to update those guest registers. For us, we need to remove all those constrains assuming a constant FS/GS, and allow fsgsbase instructions to update FS/GS base registers in the guest. Because valgrind is emulating FS/GS in the guest, there is no need to check for the real hardware support for these instructions on the host. For details of the patch, please check [2].

2. Support for rdrand/rdseed

rdrand call the TRNG available inside the CPU to generate a random number [8]. rdseed is similar although focusing on providing random seed for PRNG [9]. The difference between them can be found at [10]. Unlike the fsgsbase instructions, valgrind needs to check whether or not the host CPU supports rdrand/rdseed when encountering these instructions in the client program, and delegate the acutal execution to the real CPU on the host. (Although we could emulate these instructions in valgrind as well, faithfully executing them is more preferred especially when the CPU supports these instructions.)

Once we have extended CPUID to detect these instructions on the host CPU, we can start to write down “dirtyhelpers” for rdrand/rdseed, which are the actual rdrand/rdseed instructions running on the real CPU. Because these instructions may fail (non-block, carry flag not set), we need to do a loop on the carry flag, making sure we return the right rand/seed to the guest. Similarly, a sane implementation of rdrand/rdseed within the client program should also do a loop on the carry flag. This means we need set the carry flag in the rflags of the guest VM state to help the client program move forward. Turns out it is not easy to do this in valgrind, because the rflags is not listed as other registers of the guest VM state explicitly. Instead, all these flags need to be computed based on the operation of the current instruction.

BTW, rdrand/rdseed is also a good example of the pathologicial design of x86_64 instruction encoding. They have the same opcode as cmpxchg8b and cmpxchg16b. For details of this patch, please check [3].

3. A new trapdoor: add-symbol-file

GDB supports loading symbols manually using add-symbol-file command. It is useful when GDB could not figure out what was loaded at certain VA range (thus ??? in the backtrace). Unfortunately, valgrind does not have such a mechanism. As a result, valgrind could not recognize any memory mapping not directly triggered by mmap() syscall, e.g., memcpy from VA1 to VA2. It also means valgrind could not recognize a binary doing a reloation by itself after the first mmap(), such as loader. Based on these considerations, we add a new valgrind trapdoor (client request) — VALGRIND_ADD_SYMBOL_FILE, allowing a client program to pass the memory mapping information to valgrind. It accepts 3 arguments, the file name of the mapping, e.g., a shared object, the starting mapping address (page aligned), and the length of the mapping. For details of this this patch, please check [4].


[1] https://github.com/daveti/valgrind
[2] https://github.com/daveti/valgrind/commit/16ccd1974ce2ca13e10adac9906de5bc689c509d
[3] https://github.com/daveti/valgrind/commit/5986cc4a0c6bf2d41822df15e8f074437c32e391
[4] https://github.com/daveti/valgrind/commit/baa7d6b344a539b8842d7c157ab67af990213500
[5] https://fortanix.com/
[6] https://www.felixcloutier.com/x86/RDFSBASE:RDGSBASE.html
[7] https://www.felixcloutier.com/x86/WRFSBASE:WRGSBASE.html
[8] https://www.felixcloutier.com/x86/RDRAND.html
[9] https://www.felixcloutier.com/x86/RDSEED.html
[10] https://software.intel.com/en-us/blogs/2012/11/17/the-difference-between-rdrand-and-rdseed

Posted in Dave's Tools, Programming | Tagged , , , , , , , , , | Leave a comment

Valgrind trapdoor and fun

Valgrind has a client request mechanism, which allows a client to pass some information back to valgrind. This includes asks valgrind to do a logging in its own environment, tells valgrind a range of VA being used as a new stack, and etc [1]. This mechansim is essentially a trapdoor built into VEX during the binary translation. We starts with a typical usage of valgrind trapdoor to add a logging into valgrind from the client. We then remove the dependency of valgrind header files and manually craft the trapdoor by ourselves. Note that this post is NOT about how to use the client request mechanism (read the damn manual:), nor on how to add a new client request (which I will talk in another post on valgrind hacking in general). Last, the code can be found at [2], and have fun:)

1. A typical usage

To use the valgrind trapdoor, we need to include the header: <valgrind/valgrind.h>. Let’s take VALGRIND_PRINTF as an example, which asks valgrind to add a logging for the client. The code below prints out the magic number in the valgrind logging:

#define valgrind_printf_fmt_str "daveti: trapdoor, magic [%d]\n"
int magic = 777;
/* Normal valgrind trapdoor */
ret = VALGRIND_PRINTF(valgrind_printf_fmt_str, magic);
printf("daveti: ret [%d]\n", ret);

When running with valgrind, the output looks like below:
**7382** daveti: trapdoor, magic [777]
daveti: ret [30]
The return value is the total length of the string printed out. Nothing fancy here but do remember that this logging is done by valgrind rather than the client program.

2. A better understanding

Alright, so what is that damn VALGRIND_PRINTF thing? Let’s have a deeper view using objdump (-S):

0000000000400527 :
  400527:	55                   	push   %rbp
  400528:	48 89 e5             	mov    %rsp,%rbp
  40052b:	48 81 ec a8 00 00 00 	sub    $0xa8,%rsp
  400532:	48 89 bd e8 fe ff ff 	mov    %rdi,-0x118(%rbp)
  400539:	48 89 b5 58 ff ff ff 	mov    %rsi,-0xa8(%rbp)
  400540:	48 89 95 60 ff ff ff 	mov    %rdx,-0xa0(%rbp)
  400547:	48 89 8d 68 ff ff ff 	mov    %rcx,-0x98(%rbp)
  40054e:	4c 89 85 70 ff ff ff 	mov    %r8,-0x90(%rbp)
  400555:	4c 89 8d 78 ff ff ff 	mov    %r9,-0x88(%rbp)
  40055c:	84 c0                	test   %al,%al
  40055e:	74 20                	je     400580
  400560:	0f 29 45 80          	movaps %xmm0,-0x80(%rbp)
  400564:	0f 29 4d 90          	movaps %xmm1,-0x70(%rbp)
  400568:	0f 29 55 a0          	movaps %xmm2,-0x60(%rbp)
  40056c:	0f 29 5d b0          	movaps %xmm3,-0x50(%rbp)
  400570:	0f 29 65 c0          	movaps %xmm4,-0x40(%rbp)
  400574:	0f 29 6d d0          	movaps %xmm5,-0x30(%rbp)
  400578:	0f 29 75 e0          	movaps %xmm6,-0x20(%rbp)
  40057c:	0f 29 7d f0          	movaps %xmm7,-0x10(%rbp)
  400580:	c7 85 30 ff ff ff 08 	movl   $0x8,-0xd0(%rbp)
  400587:	00 00 00
  40058a:	c7 85 34 ff ff ff 30 	movl   $0x30,-0xcc(%rbp)
  400591:	00 00 00
  400594:	48 8d 45 10          	lea    0x10(%rbp),%rax
  400598:	48 89 85 38 ff ff ff 	mov    %rax,-0xc8(%rbp)
  40059f:	48 8d 85 50 ff ff ff 	lea    -0xb0(%rbp),%rax
  4005a6:	48 89 85 40 ff ff ff 	mov    %rax,-0xc0(%rbp)
  4005ad:	48 c7 85 f0 fe ff ff 	movq   $0x1403,-0x110(%rbp)
  4005b4:	03 14 00 00
  4005b8:	48 8b 85 e8 fe ff ff 	mov    -0x118(%rbp),%rax
  4005bf:	48 89 85 f8 fe ff ff 	mov    %rax,-0x108(%rbp)
  4005c6:	48 8d 85 30 ff ff ff 	lea    -0xd0(%rbp),%rax
  4005cd:	48 89 85 00 ff ff ff 	mov    %rax,-0x100(%rbp)
  4005d4:	48 c7 85 08 ff ff ff 	movq   $0x0,-0xf8(%rbp)
  4005db:	00 00 00 00
  4005df:	48 c7 85 10 ff ff ff 	movq   $0x0,-0xf0(%rbp)
  4005e6:	00 00 00 00
  4005ea:	48 c7 85 18 ff ff ff 	movq   $0x0,-0xe8(%rbp)
  4005f1:	00 00 00 00
  4005f5:	48 8d 85 f0 fe ff ff 	lea    -0x110(%rbp),%rax
  4005fc:	b9 00 00 00 00       	mov    $0x0,%ecx
  400601:	89 ca                	mov    %ecx,%edx
  400603:	48 c1 c7 03          	rol    $0x3,%rdi
  400607:	48 c1 c7 0d          	rol    $0xd,%rdi
  40060b:	48 c1 c7 3d          	rol    $0x3d,%rdi
  40060f:	48 c1 c7 33          	rol    $0x33,%rdi
  400613:	48 87 db             	xchg   %rbx,%rbx
  400616:	48 89 d0             	mov    %rdx,%rax
  400619:	48 89 85 28 ff ff ff 	mov    %rax,-0xd8(%rbp)
  400620:	48 8b 85 28 ff ff ff 	mov    -0xd8(%rbp),%rax
  400627:	48 89 85 48 ff ff ff 	mov    %rax,-0xb8(%rbp)
  40062e:	48 8b 85 48 ff ff ff 	mov    -0xb8(%rbp),%rax
  400635:	c9                   	leaveq
  400636:	c3                   	retq

A quick code go-thru shows that this function does “NOTHING”, except saving some registers on the stack before updating them. This is actually the design of valgrind trapdoor – it should not change any registers or memory when the client program does not run with valgrind. In other words, only valgrind is able to interpret this trapdoor and do something with side effect. Let’s dive into this function.

The first around 20 lines are a typical usage of va_list, because VALGRIND_PRINTF accepts variable-length arguments like printf. Then we see bunch of values pushed into the stack, including this magic value 0x1403:

movq $0x1403, -0x110(%rbp)

And then more “useless” code near the end of the function:

rol $0x3, %rdi
rol $0xd, %rdi
rol $0x3d, %rdi
rol $0x33, %rdi
xchg %rbx, %rbx

After all those rotations, rdi is unchanged, as well as rbx. Now it is time to look at the valgrind.h file [3] to sort things out, and here it goes:

#define __SPECIAL_INSTRUCTION_PREAMBLE                            \
                     "rolq $3,  %%rdi ; rolq $13, %%rdi\n\t"      \
                     "rolq $61, %%rdi ; rolq $51, %%rdi\n\t"

#define VALGRIND_DO_CLIENT_REQUEST_EXPR(                          \
        _zzq_default, _zzq_request,                               \
        _zzq_arg1, _zzq_arg2, _zzq_arg3, _zzq_arg4, _zzq_arg5)    \
    __extension__                                                 \
    ({ volatile unsigned long int _zzq_args[6];                   \
    volatile unsigned long int _zzq_result;                       \
    _zzq_args[0] = (unsigned long int)(_zzq_request);             \
    _zzq_args[1] = (unsigned long int)(_zzq_arg1);                \
    _zzq_args[2] = (unsigned long int)(_zzq_arg2);                \
    _zzq_args[3] = (unsigned long int)(_zzq_arg3);                \
    _zzq_args[4] = (unsigned long int)(_zzq_arg4);                \
    _zzq_args[5] = (unsigned long int)(_zzq_arg5);                \
    __asm__ volatile(__SPECIAL_INSTRUCTION_PREAMBLE               \
                     /* %RDX = client_request ( %RAX ) */         \
                     "xchgq %%rbx,%%rbx"                          \
                     : "=d" (_zzq_result)                         \
                     : "a" (&_zzq_args[0]), "0" (_zzq_default)    \
                     : "cc", "memory"                             \
                    );                                            \
    _zzq_result;                                                  \

Turns out those rotations instructions are the essential trapdoor for x86_64. The xchg is used to ask valgrind to do a client request where the request number is _zzq_args[0] and the return value is saved into rdx. As you might have guessed, the request number is 0x1403 for VALGRIND_PRINTF.

In summary, when valgrind sees those rol instructions followed by an xchg, it recognizes this trapdoor, and passes arguments from the stack. The first argument, request number, determines which valgrind function will be called internally. The return value will be hold in rdx and then futher propagated to the client program via rax.

3. Fun

Once we know what the trapdoor looks like, we can get rid of the dependency on valgrind.h header file, and craft our own trapdoor. Say we wanna make our own VALGRIND_PRINTF. Then what we need are a va_list, filled with format strings and variable-length arguments, and the trapdoor instructions followed by xchg:

#define valgrind_printf_code		0x1403
#define valgrind_printf_fmt_str		"daveti: trapdoor, magic [%d]\n"
#define valgrind_trapdoor_code		\
	"rol $0x3, %%rdi\n\t"		\
	"rol $0xd, %%rdi\n\t"		\
	"rol $0x3d, %%rdi\n\t"		\
	"rol $0x33, %%rdi\n\t"

static unsigned long valgrind_printf_manual(char *fmt, ...)
	unsigned long args[6] = {0};
	unsigned long ret = 0;
	va_list vargs;

	/* Follow valgrind ABI */
	va_start(vargs, fmt);
	args[0] = (unsigned long)valgrind_printf_code;
	args[1] = (unsigned long)fmt;
	args[2] = (unsigned long)&vargs;

	/* rdx = client_req(rax); */
	asm volatile ("mov $0x0, %%rdx\n\t"	\
			valgrind_trapdoor_code	\
			"xchg %%rbx, %%rbx\n\t"	\
			: "=d"(ret)		\
			: "a"(&args[0])		\
			: "cc", "memory");
	return ret;

That’s it. Now we have a homemade VALGRIND_PRINT – valgrind_printf_manual, which behaves exactly what the former does, and we do not need to include valgrind.h header file at all.

NOTE 1: For other client requests (other than VALGRIND_PRINTF), the arguments building should be more straight-forward. VALGRIND_PRINTF is tricky due to the usage of variable-length arguments. And we decided it to use it because of its obvisous side effect (log printing in valgrind).

NOTE 2: While the trapdoor mechanism is the same across architectures, the trapdoor instructions are different among different architectures. We limit out focus on x86_64. Nevertheless, all these trapdoor instructions should follow the same deign goal – no changes on registers or memory when invoked without valgrind.

4. Security vs. Obfuscation

The design of valgrind trapdoor is delicate and useful. It gives a client program and opportunity to pass some useful information to valgrind, e.g., to suppress false positives in memcheck. Meanwhile, because we could craft the trapdoor manually, leaving no trace of valgrind in the binary, a client program is able to detect if it is running under valgrind essentially. Based on the detection result, the client program may do something totally different (for PoC, please check out [2]).

From security perspectives, a client program can detect the valgrind running environment, thus skip malicious behaviors which might be found by certain valgrind plugin, similar as VM detection techniques used by malware. From obfuscation points, a client program can also hide critical functionality from being analyzed by valgrind during runtime. Although I have not seen a strong motivation to detect valgrind as VM, the trapdoor mechanism has already provided a neat technique to achieve this.


[1] http://valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.clientreq
[2] https://github.com/daveti/valtrap
[3] https://github.com/daveti/valgrind/blob/zircon/include/valgrind.h

Posted in Programming, Security | Tagged , , , | Leave a comment

Some notes on SGX OwnerEpoch and Sealing

Intel SGX has been there in the market for while. Yet there are still a lot of misundrestandings and mysteries about this technology. This post provides an introduction to Intel SGX OwnerEpoch and Sealing, discusses their security impacts, and speculates future usages. Note that this post assumes a general understanding of Intel SGX and its key hierarchy.

1. Intro

SGX OwnerEpoch is a 128-bit value used in key derivation, as shown in the figure below [1]:


According to [1], this value is “loaded into the SGXOWNEREPOCH0 and SGXOWNEREPOCH1 MSRs when Intel SGX is booted”. The whole purpose of this value is to “provide a user with the ability to add personal entropy into the key derivation process”. As a result, it is included in all key derivations by egetkey leaf instruction based on [1], such as the Sealing key.

While an enclave provides runtime integrity and confidentiality, it cannot persist the secret across reboots. In such a case, sealing is to help. Intel SGX Sealing uses the egetkey leaf instruction to derive the sealing key on the platform. This sealing key is then used to encryt the secret within the enclave before it is written into the disk. Depending on the sealing policy, either public key of the enclave signer (MRSIGNER) or the measurement of the enclave (MRENCLAVE) can be used to derive the key, meaning that only the enclaves from the same signer or the ones with the exact measuremement can unseal (decrypt) the secret. Note that both sealing and unsealing should happen inside an enclave.

2. Security Impacts

Since OwnerEpoch is also included to derive the sealing key, changes of this regsiter would cause unsealing failure on the same platform. Consequently, a malicious cloud provider can launch DoS attacks against all SGX sealed secrets easily by updating the OwnerEpoch. Or in a more realistic case when there is a contract between the cloud provider and user, the cloud provider needs to guarantee that no code outside the TCB can update the OwnerEpoch (which is usually the case since wrmsr is a privilaged instruction, and hyperviosrs can trap it), and that no code outside the TCB can trick the TCB to update the OwnerEpoch (e.g., confused deputy attack and kernel exploitation). In the worst case, the current in-use OwnerEpoch should always have a backup to help restore the value for unsealing.

Although we could have 2 platforms with the exact same model of SGX CPU and exact same value for OwnerEpoch (we also assume the same CPUSVN and etc.), sealing on one platform cannot be unsealed on the other due to the unique device key per CPU package. This means SGX sealing does not support offline cross-platform data migration. As a workaround for this case, SGX remote attestation is needed to establish a shared secret as the sealing key rather than using the egetkey leaf instruction.

3. Speculations

A question comes naturally for the OwnerEpoch – why do we need it and what can we do with it? By definition, it is used to provide “user” entropy to the key derivation process. It also implies that the “user” should be the “owner” of the platform (CPU), since both rdmsr and wrmsr are privileged instructions. In a cloud environment, however, this “user==owner” relationship breaks. Cloud users are the “user” while cloud providers being the “owner”.

In a physical environment where the user “own” the infrastructure (IasS), the user should be able to set the OwnerEpoch whatever he wants. It is the same case as people running SGX applications on their own laptops. In this case, rather than providing entropy, the OwnerEpoch might be used as a peronal secret to pretect sealing data. For example, Alice saves the current OwnerEpoch value after sealing, and resets it to a random value. Eve cannot unseal the data even with root permission on Alice’s machine without the right OwnerEpoch.

In a container environment where different users running different containers on the same platform, none of the users would have the permission to update the OwnerEpoch. Instead, the cloud provider sets the value, and all users share the same OwnerEpoch during their key derivation. In this case, the OwnerEpoch seems meaningless for both cloud providers and users except adding more entropy.

In a hypervisor environment where different users running different guest OSes managed by the hypervior that has the sole control of the hardware (e.g., Xen and KVM), it is possible to virtualize the OwnerEpoch per guest (e.g., adding the OwnerEpoch into VMCS). Each user can provide his own OwnerEpoch for SGX key derivation. Note that this per-guest OwnerEpoch is only known to the guest and the cloud provider. As long as the cloud provider is trusted, this per-guest OwnerEpoch can be used as a personal secret as well. Note that this secret usage might be really useful when different users running the same enclave signed by the same ISV. In this case, similar as the physical environment, data sealed by Alice cannot be unsealed by Eve even they are running on the same platform.

4. Reality

While SGXv1 has introduced OwnerEpoch, it is not activated – we cannot write into it. SGXv2 claims the support for updating OwnerEpoch based on [2], my testing on a SGXv2 CPU said no. In fact, it behaves just like SGXv1 – The first OwnerEpoch read throws an unchecked MSR access error; the following write enables the read opertaion, although the value is always 0, no matter what value is written. To test the OnwerEpoch on your platform, please git clone [4]. My general feeling is that this OwnerEpoch is still not activated for some reason (at least on the SGXv2 CPU I tested). One comment from coreboot also suggests that the OwnerEpoch update mechanism is not determined yet [3]. Another update from [2] also shows that Provisoning and Provisioning Sealing keys do not rely on OwnerEpoch anymore.

5. Conclusion

We look into the OwnerEpoch and its connection with key derivations, e.g., SGX sealing. As we discussed above, the introducing of OwnerEpoch as extra entropy seems really vague. Nevertheless, we speculated its usage as a personal secret in the cloud environments. Our trial on SGXv2 seems to suggest that its usage is still unclear.


[1] https://software.intel.com/sites/default/files/managed/48/88/329298-002.pdf
[2] https://software.intel.com/en-us/articles/intel-sdm
[3] https://github.com/coreboot/coreboot/blob/master/src/soc/intel/common/block/sgx/sgx.c
[4] https://github.com/daveti/soe

Posted in Security | Tagged , , , , , | Leave a comment

Kernel Code Execution Time Measurement (kcetm)

This post mainly talks about the correct usage of tsc counters provided by Intel x86/x86-64 architectures to measure the Linux kernel code execution time. Most of the content here is borrowed/inspired from [1]. Note that this is NOT a post introduing tsc but for people who need to use tsc anyway. A kernel module with us/ns/tsc measurement framework ready is also provided at the end of the post [4]. Happy measurement!

1. Why measuring time is hard?

Human beings have a long history of trying to measure time [3], although this is not our focus. We are talking about measuring a piece of code execution running at ring 0 on a modern x86/x86-64 Intel CPU. The measurement framework usually looks like below:


The most common measurement functions inside the kernel are do_gettimeofday(), which returns in macroseconds, and getnstimeofday64(), which returns in nanoseconds. Depending on the complexity of the code needs to be measured, different scaling measurement should be applied. E.g., measuring a sub macrosecond function using do_gettimeofday() is inappropreiate. Similarly, measuring a minute level function using getnstimeofday64() is also an overkill.

In most cases, when the piece of code measured is non-trivial and takes a substantial amount of time (e.g., miliseconds), these gettimeofday functions do their job. We usually ignore the delay caused by interrupt handlings, preemptions, and schedulings. However, when the code to be measured is really small (e.g., tens of assemblies), the delay caused by interrupt handlings, preemptions, and schedulings cannot be ignored anymore, since we are measuring a nanosecond level execution time. In this case, we need to disable (local) interrupt, and preemption, making sure the measurement is “atomic” without external interferences. The other thing we need is tsc (timestamp counter).

2. What is the right way to use tsc?

Here we ignore the mov instructions needed to read/merge the tsc values. Instead, we focus on different usage patterns of rdtsc and rdtscp. We explore these patterns and point out the potential issues.

Pattern A:

...       A
...       |
rdtsc;    V
rdtsc;    A
...       |
...       V

This might be the most straightforward way of using tsc. We call rdtsc to read the counter before and after the code_to_measure. Unfortunately, because of out of order execution of CPU, it is hard to guarantee that the code_to_measure within the rdtsc blocks is exactly the code we wanna measure. It turns out either the code belong to code_to_measure could be executed by the CPU ahead of the first rdtsc or after the second rdtsc, or whatever code before the first rdtsc or after the second rdtsc could be measured as part of the code_to_measure. In short, this pattern gives the measurement of timing(some_portion(code_to_measure) + some_portion(code_to_measure_before) + some_portion(code_to_measure_after)).

Pattern B:

...       A
...       X
...       V
rdtsc;    | A
...       X |
...       | |
...       V |

To surpress the effect of out of order execution, we need a processor barrier to guarantee that code_to_measure is exactly the code we wanna measure when we setup the tsc counters – no more or less (hopefully). CPUID is one such instruction that serializes the instruction flow. The first CPUID guarantees that all instructions before it have done; the second CPUID guarantees that code_to_measure is also done before we execute the second rdtsc. However, it is still possible that the code after the second rdtsc get executed after the second CPUID but before we hit the second rdtsc (we mark this as delta). Similarly, the first rdtsc, should be execuated as the first instruction once the CPUID is done. However, out of order execution within the CPUID block is still possible, making few instructions of code_to_measure execute before the first rdtsc (we mark this as the other delta). Fortunately, because rdtsc uses RAX and RDX the same, it is possible that the CPU would get stalled when it tries to excute instructions followed by rdtsc due to data hazard. Nevertheless, the biggest problem comes from the extra CPUID instruction measured. Per [1], the exeuction time of CPUID itself is NOT stable, ranging from tens of CPU cycles to hundreds. This poses issues when the code_to_measure has similar timing scale. In short, this pattern gives the measurement of timing(code_to_measure + cpuid – delta_1 + delta_2).

Pattern C:

...       A
...       X
...       V
...       A
...       X
...       V

Comparing with the previous pattern, we add CPUID after each rdtsc. The first added CPUID gurantees that rdtsc is executed before any instructions from code_to_measure; the last CPUID added guarantees that the last rdtsc is executed after code_to_measure is done and before whatever code followed. This solution elimites the measurement deltas in the price of the other extra CPUID in the measurement block. In short, this pattern gives the measurement of timing(code_to_measure + 2*cpuid).

Pattern D:

...       A
...       X
...       V
...       A
...       X
...       V

At this point, I hope it is clear that the first CPUID has to happen before the first rdtsc (otherwise we may include the code above into the measurement block since the first rdtsc is not guaranteed to get executed right before the CPUID). Moving the second rdtsc before the second CPUID gurantees that the code after the second CPUID will not get measured. There is also no extra CPUID added into the measurement block. Unfortunately, this does not work due to the possible out of order execution insided the CPUID block. In short, this pattern gives the measurement of timing(some_portion(code_to_measure)).

Pattern E:

...       A
...       X
...       V
rdtscp;   | A
...       X |
...       | |
...       V |

To deal with the measurement variance introduced by the usage of CPUID, rdtscp is introduced in later CPUs, which is essentially a combination of CPUID+rdtsc. This instruction is also able to guarantee that by the execution of itself, all the instructions before it have been done. Unfortunately, the instructions after rdtscp may still get executed before rdtscp itself. In short, this pattern gives the measurement of timing(code_to_measure – delta_1 + some_portion(code_after_rdtscp)).

Pattern F:

...       A
...       X
...       V
...       A
...       X
...       V

By appending another CPUID after rdtscp, now the code after rdtscp cannot get executed before rdtscp. In short, this pattern gives the measurement of timing(code_to_measure – delta_1). If we ignore delta here, this pattern provides the most “accurate” measurement so far.

Pattern G:

...       X
...       | A
...       V |
rdtscp;     |
...       A
...       X
...       V

Comparing to the previous pattern, we replaced the first CPUID+rdtsc with rdtscp. While the first rdtscp guarantees that all the instructions before itself have been done, it cannot stop the CPU from executing some code from code_to_measure before rdtscp. To fix this problem, we can add a CPUID right after the first rdtscp, which unfortunately defeats the purpose of avoiding CPUID inside the measurement block. In short, this pattern gives the measurement of timing(some_portion(code_to_measure)).

3. Discussion

There are other instructions like CPUID, which can serialize the instruction flow, such as lfense and mfense (which are also used to defend Meldown and Spectre attacks). For user space measurement using tsc, [2] provides examples. Note, however, the measurement from user space includes the noise from interrupt handling, preemption, and scheduling.

4. Conclusion

It is easy to use tsc, but hard to use it right. Even if we use it right (e.g., Pattern F), we still could not guarantee the measurement is 100% right, but we are getting pretty close. The kcetm [5] implements Pattern F in a kernel module and is free to use.


[1] https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
[2] https://github.com/andyphillips/time_stamp_counters/blob/master/tsc.c
[3] https://nrich.maths.org/6070
[4] https://github.com/daveti/kcetm

Posted in Dave's Tools, OS, Programming | Tagged , , , , , , , , | 1 Comment