Ubuntu Kernel Build Again

I wrote two blog posts about Linux kernel build on Ubuntu [1,2]. There is also an official wiki page talking about the same thing [3]. Still, things are broken when I try to create a homework assignment for my class. This post is about how to correctly, easily and quickly build Linux kernel on Ubuntu 16.04 and 18.04. This whole process is used and verified by myself. Happy hacking.

1. Install building dependencies

sudo apt-get build-dep linux linux-image-$(uname -r)
sudo apt-get install libncurses-dev flex bison openssl libssl-dev dkms libelf-dev libudev-dev libpci-dev libiberty-dev autoconf

2. Fetch the Linux kernel source

Make sure /etc/apt/sources.list contains at least these 2 source entries (e.g., uncommenting). Note that the CODENAME is “xenial” for 16.04 and “bionic” for 18.04. Once the source entries are there, update the list file and fetch the kernel source.

deb-src http://us.archive.ubuntu.com/ubuntu CODENAME main
deb-src http://us.archive.ubuntu.com/ubuntu CODENAME-updates main
sudo apt update
sudo apt-get source linux-image-unsigned-$(uname -r)

Note that unlike the typical “linux-image-$(uname -r)”, there is “unsigned” added to fetch the kernel source to work around the bug caused by the kernel signing [4]. You might also wonder why I am not doing git directly. The reason is simple: the apt-get approach gives the exact kernel source files that your system is running now, while git usually provides the most recent version (e.g., master) for certain distro versions. Unless you need to play with cutting-edge features from the most recent kernels, it is wise to stick with the current kernel that proves working on your system. The other benefit of this approach is about generating the config – we could simply reuse the existing config without any trouble.

3. Build the kernel

I do not use the official Ubuntu kernel build procedure [3], which is tedious and moreover does not support incremental build. Note that X should be the kernel version you have downloaded (apt-get), and Y could be the number of cores available in your system for a parallel build.

cd linux-hwe-X
sudo make oldconfig
sudo make -jY bindeb-pkg

4. Install the new kernel

cd ..
sudo dpkg -i linux*X*.deb
sudo reboot

5. Sign kernel modules (optional)

If you are working with Ubuntu 16.04, it is likely that you do not need to deal with kernel module signing. But in case you need to, take a look at [6]. On Ubuntu 18.04, Lockdown is enabled by default to prevent the kernel from loading unsigned kernel modules. A quick and dirty fix is to disable Lockdown to load modules [7]:

sudo bash -c 'echo 1 > /proc/sys/kernel/sysrq'
sudo bash -c 'echo x > /proc/sysrq-trigger'

disable_lockdown

6. Resize VM disk (optional)

To build a kernel, a VM might need at least 32GB space (tested on Ubuntu 16.04, ). qemu-img is a convenient command to resize your VM image size.

sudo qemu-img resize /PATH_TO_YOUR_VM_IMG_FILE +20G
sudo qemu-img resize --shrink /PATH_TO_YOUR_VM_IMG_FILE -20G
sudo qemu-img info /PATH_TO_YOUR_VM_IMG_FILE

Note that the actual image size change cannot be seen from the host machine using e.g., fdisk. Instead, use qemu-img info to confirm the difference between “virtual size” and “disk size”. The former is changed by the resizing and the later has to be changed as well. We need to boot into the VM and grow the partition.

Once we are inside the VM, use lsblk to confirm the size that we have just grown. In the figure down below, the whole /dev/vda size is 40G but all the partitions only take 20G. We can grow the root partition (/) using parted. Unfortunately, it did not work as you can find in the figure.

grow_root

Why? Because we could not grow a partition which is not the last. Instead, we need to delete vda5 and vda2 if we wanna grow vda1. Again, parted is your friend here [5]. After we are able to append the extra 20G to the root partition (vda1), we need to fix the partition and resize the filesystem accordingly:

sudo apt-get install cloud-guest-utils
sudo growpart /dev/vda 1
sudo resize2fs /dev/vda1   

References:

[1] https://davejingtian.org/2013/08/20/official-ubuntu-linux-kernel-build-with-ima-enabled/
[2] https://davejingtian.org/2018/03/15/make-deb-pkg-broken/
[3] https://wiki.ubuntu.com/Kernel/BuildYourOwnKernel
[4] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1790880
[5] https://unix.stackexchange.com/questions/196512/how-to-extend-filesystem-partition-on-ubuntu-vm
[6] https://www.kernel.org/doc/html/v4.15/admin-guide/module-signing.html
[7] https://bugzilla.redhat.com/show_bug.cgi?id=1599197

Posted in IDE_Make, Linux Distro, OS | Tagged , , , , , , , , , , , , | 1 Comment

USB Fuzzing: A USB Perspective

Syzkaller [1] starts to support USB fuzzing recently and has already found over 80 bugs within the Linux kernel [2]. Almost every fuzzing expert whom I talked to has started to apply their fuzzing techniques to USB because of the high-security impact and potential volume of vulnerabilities due to the complexity of USB itself. While this post is NOT about fuzzing or USB security in general, I hope to provide some insights for USB fuzzing in general as someone who has been doing research on USB security for a while. Happy fuzzing!

1. Understand USB Stacks

USB is split into two worlds due to the master-slave nature of the protocol: USB host and USB device/gadget. When we talk about USB, it usually refers to the USB host, e.g., a laptop with a standard USB port. The figure below is the Linux USB host stack. From bottom to up, we have the hardware, kernel space, and user space.

usbfuzz-host-arch
From the Syzkaller USB fuzzing slides by Andrey Konovalov [3].

The USB host controller device (aka, HCD) is a PCI device attached to the system PCI bus and provides USB connection supports via USB ports. Depending on the generation of the USB technology, it is also called UHCI/OHCI for USB 1.x, EHCI for USB 2.x, and XHCI for USB 3.x controllers. For kernel to use this controller, we need a USB host controller driver, which sets up the PCI configuration and DMAs. Above it is the USB core, implementing the underlying USB protocol stack (e.g., Chapter 9) and abstracting ways to send/recv USB packets with generic kernel APIs (submit/recv URB). Above it are different USB device drivers, such as USB HID drivers and USB mass storage drivers. These drivers implement different USB class protocols (e.g., HID, Mass Storage), provide glue layers with other subsystems within the kernel (e.g., input and block), facilitate user spaces (e.g., creating /dev node).

Since Linux is also widely used in embedded systems, e.g., some USB dongles, USB device/gadget refers to both the USB dongle hardware and the USB mode within Linux. “Surprisingly”, it is totally different from the USB host mode. The figure down below demonstrates the USB gadget stack within the Linux kernel.

usbfuzz-gadget-arch
From the Syzkaller USB fuzzing slides by Andrey Konovalov [3].

At the bottom, we have the USB device controller (aka, UDC). Like HCDs, UDCs also implement specific version of the USB standards within the PHY layer. However, unlike the most common HCDs made by Intel, UDC IPs are found from different hardware vendors [8], such as DWC2/3, OMAP, TUSB, and FUSB. These controllers usually have their own design specifications, and might follow the HCD specification (e.g., XHCI specification) as well when they support USB On-The-Go (aka, OTG) mode. OTG allows a UDC to switch between USB host and USB device/gadget modes. For example, when an Android device connects with a laptop as MTP, the Android USB device controller is in the USB device/gadget mode. If a USB flash drive is plugged into an Android device, the UDC works in the USB host mode. A UDC supporting OTG is also replaced by a Dual-Role Device (DRD) controller in USB 3.x standards [11]. As a result, an OTG cable is not needed to switch the role of the UDC, since the role switching is done in software for a DRD controller.

To use a UDC, you need a UDC driver within the kernel, providing connection and configuration over industry-standard buses including both the AMBA™ AHB and AXI interfaces, and setting up DMAs for the higher layer. Like the USB core within USB host stack, the USB gadget core within USB gadget stack provides APIs to register and implement a USB gadget function via callbacks and configfs. For instance, we can pass USB descriptors to the USB gadget core and achieve a typical USB mass storage device by requesting the existing mass storage function (f_mass_storage). For more complicated protocols such as MTP, a user-space daemon or library provides the protocol logic and communicates with the gadget function via e.g., configfs or usbfs.

2. Where We Are

USB fuzzing started to attract more attention thanks to the FaceDancer [4], a programmable USB hardware fuzzer. It supports both USB host and device/gadget mode emulation and allows sending out pre-formed or mal-formed USB requests and response. Umap/Umap2 [5] provides a fuzzing framework written in Python with the different USB device and response templates for the FaceDancer. The TTWE framework [9] enables MitM between a USB host and a USB device by using 2 FaceDancers emulating the USB host and device/gadget, respectively. This MitM allows USB packet mutations for both directions, thus enables fuzzing on both sides.

All these solutions focus on the USB host stack due to the facts that people assume a malicious USB device rather than a malicious USB host, e.g., a laptop, and that most USB device firmware is closed source and thus hard to analyze. Accordingly, most of the bugs/vulnerabilities are found within the USB core (for parsing USB response) and some common USB drivers (e.g., keyboard). The pros of these solutions are their ability to faithfully emulate a USB device. However, the problems, in my opinions, are:

a. Hardware dependency.
b. Limited feedback from the target.

FaceDancer is slow, which makes any solution built upon it not scale. The fact that we need both a FaceDancer and a target machine as the minimum to start fuzzing also imposes more challenges for scalability. Feedback is the other big issue here. Mutations of the fuzzing input are based on templates and randomizations without real-time feedback from the target (e.g., code coverage) except system logging. Thus, fuzzing efficiency is questionable. As a result, these solutions are “best-effort” to find some bugs with a minimum setup effort.

To get rid of the hardware dependency, virtualization (e.g., QEMU) comes to save. vUSBf [6] uses QEMU/KVM to run a kernel image and leverages the USB redirection protocol within QEMU to redirect the access to USB devices to a USB emulator controlled by the fuzzer, as shown below:

vusbf-arch
From the vUSBf paper [6].

While vUSBf provides a nice orchestration architecture to run multiple QEMU instances in parallel solve scalability issues, the fuzzer itself is essentially based on templates (or test cases according to the paper. The feedback still relies on system logging. POTUS [7] moves a step forward to leverage symbolic execution, e.g., S2E, to inject faults from the USB HCD layer, as shown below:

potus-arch
From the POTUS paper [7].

SystemTap is used to instrument the kernel to inject fault and annotations to save the number of faults. A path prioritization algorithm based on the number of faults within different states is used to control the number of “forks”. The number of faults of a given path represents the code coverage. Thus, a significant number of faults represents a high code coverage. POTUS also implements a generic USB virtual device within QEMU to emulate different USB devices using configurable device descriptors and data transfer. The Driver Exerciser within the VM uses syscalls to play with different device node exposed to the VM. Comparing to vUSBf, POTUS includes a fuzzing feedback mechanism (by counting the number of faults within a path) and support more USB device emulations. However, the manual effort to emulate operations on certain USB devices within the Driver Exerciser, fundamental limitations of symbolic executions – path explosion, and the unknown effectiveness and limitations of relying on the number of faults of a path for path scheduling, make POTUS hard to evaluate in the real world.

Syzkaller [1] USB fuzzing support [12] was added recently by Andrey Konovalov at Google., and has demonstrated its ability to find more bugs. Andrey solved two main problems of using Syzkaller to fuzz USB:

a. Code coverage for kernel tasks.
b. Device emulation within the same kernel image.

Since USB events and operations happen within IRQ or kernel context rather than a process context (e.g., USB plugging detection within khub kernel task in older kernels), syscall-based tracing and code coverage [10] simply won’t work. We need the ability to report the code coverage anywhere within the kernel. To do that, we need to annotate the USB-related kernel source (e.g., hub.c) with the extended KCOV kernel APIs to report the code coverage [13]. Instead of relying on QEMU, Syzkaller uses the gadgetfs to expose a fuzzer kernel driver to the user space [14], which can then manipulate the input for fuzzing. By enabling both the USB host stack and the USB gadget stack in the kernel configuration and connecting them using the dummy HCD and UDC drivers together as shown below, Syakaller is able to fuzz USB host device drivers, such as USB HID, Mass Storage, etc., from the user-space fuzzing the USB fuzzer kernel driver.

syzkaller-arch.png
From the Syzkaller USB fuzzing slides by Andrey Konovalov [3].

Syzkaller USB fuzzer might be the first real coverage-based USB host device driver fuzzer thanks to the existing Syzkaller infrastructure and nice hackings to bridge the USB host and gadget the same time. While it has found tons of bugs and vulnerabilities, the limitation of the fuzzer starts to reveal – most of the issues found were in the initialization phase of a driver (e.g., probing). In the user space, the fuzzer is able to configure the fuzzer kernel driver to represent any USB device/gadget by exploring different VID/PID combinations within the USB device descriptor. On the one hand, Syzkaller is able to trigger almost every USB host device driver to be loaded, thus having a prominent code coverage horizontally. On the other hand, since no real emulation code for a certain device is provided within the user-space fuzzer or the fuzzer kernel driver, most fuzzing stops after the initialization of a driver, thus covering only a small portion of the driver vertically.

3. What To Be Done

Cautions readers might have found it already – all these fuzzing solutions focus on the USB host stack, especially on the USB host device drivers. Again, this is due to the facts that people often refer USB to the USB host stack, and that these device drivers are famous for containing more vulnerabilities than other components within the kernel (e.g., device drivers on Windows). However, at this point, I believe you have realized that what has been covered on USB fuzzing so far is the tip of the iceberg, either horizontally or vertically. Let’s enumerate what to be done next.

a. HCD drivers fuzzing

If we limit ourselves within the USB host stack, it is interesting to find HCD drivers are ignored. Unlike device drivers, they are not accessible from the user space via syscall (except tunning some parameters using sysfs). Instead, they receive inputs from the USB core (e.g., usb_submit_urb) in the upper layer (internal) and DMAs of the HCD layer (external). From a security perspective, external inputs should impose more threats than the internal ones.

To directly fuzz the internal inputs of HCD drivers, we need the ability to mutates the parameters of kernel APIs exposed to the USB core and get the code coverage from the HCD driver. To directly fuzz the external inputs of HCD drivers, we need to mutate on DMA buffers and event queues, as well as the code coverage from the HCD driver. Note that the code coverage is often different in these two cases because of different code paths for TX and RX. Thus we need a fine-granularity code coverage reporting to reflect this. Mutating on DMA buffers and event queues is essentially building an HCD emulator with fuzzing capabilities. For common HCD drivers such as Intel XHCI, QEMU provides the corresponding HCD emulation already (e.g., qemu/hw/usb/hcd-xhci.c), and one can try to add fuzzing functionality there. For other HCD drivers that QEMU does not provide the HCD emulation, one needs to build the HCD emulation from scratch.

b. USB device/gadget stack fuzzing

Yes, we do not have systematic fuzzing on the USB device/gadget stack. It used to be OK since we often assume a malicious USB device rather than a malicious host. However, the broad adoption of USB OTG and DRD controllers in embedded systems (e.g., Android devices) extends the threat model to include USB host as well. For example, no one wants their phones to be hacked during USB charging. Architecturally, Syzkaller USB fuzzer has imagined a way to fuzz the USB device/gadget stack, as shown below.

syzkaller-arch-todo
From the Syzkaller USB fuzzing slides by Andrey Konovalov [3].

Instead of having the user-space fuzzer communicating with the USB fuzzer kernel driver, another user-space fuzzer will manipulate USB host device drivers. The fuzzer activities will be propagated via the USB host stack to the USB device/gadget stack, hopefully. Accordingly, we need to configure the kernel to enable all different gadget functions within the same kernel image, as well as code coverage reporting. We are then able to fuzz the USB gadget core and USB gadget function drivers, except UDC drivers.

Note that Syzkaller imagines one way to fuzz the USB device/gadget stack, a natural result due to the architecture and limitations of Syzkaller. Syzkaller is a syscall fuzzer, meaning that the input mutations happen from syscall parameters. But this does not mean we have to fuzz at the syscall layer (e.g., user-space). If we look at the figure above again, we can find a long path from the fuzzer to the fuzzing target (e.g., USB host device drivers or USB device/gadget drivers). How could we know if all the fuzzing inputs are successfully propagated to the target instead of being filtered by middle layers in between? One core question is if syscall-based fuzzing is suitable for USB fuzzing within the kernel. Again, the Syzkaller USB fuzzer accommodates itself to the constraints of the Syzkaller itself instead of thinking about building a USB fuzzer (e.g., the USB host or gadget stacks) from scratch.

To shorten the fuzzing path is to push the fuzzer (inputs) close to the target. For example, we could get rid of the whole USB host stack by building a USB UDC emulator/fuzzer within QEMU, directly enabling UDC driver fuzzing. However, this does not mean any DMA write can be translated into a valid USB request to the USB gadget drivers in the upper layer. As a result, the fuzzing path is indeed shorter, but we still hope that the mutation algorithm and the code coverage granularity will save us. In the end, we might need different fuzzers for different layers within the stack, making sure all fuzzing inputs applied to the target without being filtered. E.g., we may need to build a USB host emulator/fuzzer sending out USB requests to different USB gadget drivers directly.

c. Android USB fuzzing

Android might be the heaviest USB device/gadget stack user via maintaining its own branch of the kernel and implementing extra USB gadget function drivers (e.g., MTP). The OTG/DRD support within Android devices also doubles the attacking surface comparing to a typical USB host machine. The fundamental challenge is to run an Android kernel image with the corresponding UDC/DRD drivers used by real-world Android devices using QEMU. Running non-AOSP kernels in QEMU imposes extra difficulties due to SoC customizations and variations. That’s why a lot of Android fuzzing still requires a physical device.

d. Protocol-guided/Stateful fuzzing

In section b, we talked about why we might wanna shorten the fuzzing path because we wanna avoid fuzzing inputs being filtered before hitting the target. Turns out it is much more complicated here. If we look at the figure above on USB device/gadget fuzzing imagined in Syzkaller again, the fuzzer inputs start with syscalls, pass different USB host device drivers before finally delivered to the USB gadget stack. Yes, the fuzzing path is long, and the fuzzing input could be filtered along the way. Meanwhile, these extra layers in between guarantee that whatever fuzzing inputs sent out is legitimate USB request carrying the corresponding protocol payload triggered by the right driver state. For example, the final USB request generated by the fuzzer via the USB mass storage driver might contain a legitimate SCSI command (e.g., read) triggered by the core logic of the USB host device driver rather than the initialization part.

This is what I call the “protocol-guided/stateful” fuzzing. As you can tell, it is essential to go “deeper” vertically within a layer, e.g., exploring other parts of a kernel driver beyond the initialization/probing phase. Simply put, to fuzz either USB host or device/gadget drivers, we need to establish a virtual connection with the target (e.g., making sure the kernel driver initialized and ready to process inputs) – stateful, and teach the fuzzer to learn the structure of the input (e.g., SCSI protocol in USB Mass Storage) – protocol-guided. In the end, it is a trade-off between including other layers to reuse the existing protocol and state controls thus increasing the fuzzing path and complexity and implementing a light-weight protocol-aware/stateful fuzzing within the fuzzer directly to reduce the fuzzing path. Both have their pros and cons.

e. Type-C/USBIP/WUSB fuzzing

There are more things in USB other than the USB host and USB device/gadget, including USB Type-C, USBIP, WUSB, etc. While we could reuse some of the lessons learns in USB fuzzing, these technologies introduce different software stacks and may require different attentions to solve their quirks.

4. Summary

This post looks into USB fuzzing, a recent hot topic from both software security and operating system security. Instead of treating USB as another piece of software, we start with understanding what USB stacks are and why USB refers to a bigger picture than what people often imagined. We survey some previous works on USB fuzzing, from using specialized hardware to running QEMU. We conclude with what is missing and foresee the future.

P.S. This blog post was long overdue. I promised to have it a month ago but never made it. I also underestimated how much time needed to finish it. A lesson learns (again, for myself) is to start early and focus. Anyway, better delay than nothing:)

References:

[1] https://github.com/google/syzkaller
[2] https://github.com/google/syzkaller/blob/e90d7ed8d240b182d65b1796563d887a4f9dc2f6/docs/linux/found_bugs_usb.md
[3] https://docs.google.com/presentation/d/1z-giB9kom17Lk21YEjmceiNUVYeI6yIaG5_gZ3vKC-M/edit?usp=sharing
[4] http://goodfet.sourceforge.net/hardware/facedancer21/
[5] https://github.com/nccgroup/umap2
[6] https://github.com/schumilo/vUSBf
[7] https://www.usenix.org/conference/woot17/workshop-program/presentation/patrick-evans
[8] https://elinux.org/Tims_USB_Notes
[9] https://www.usenix.org/conference/woot14/workshop-program/presentation/van-tonder
[10] https://davejingtian.org/2017/06/01/understanding-kcov-play-with-fsanitize-coveragetrace-pc-from-the-user-space/
[11] https://blogs.synopsys.com/tousbornottousb/2018/05/03/usb-dual-role-replaces-usb-on-the-go/
[12] https://github.com/google/syzkaller/commit/e90d7ed8d240b182d65b1796563d887a4f9dc2f6
[13] https://github.com/xairy/linux/commit/ff543afbf78902acea566fa4c635240ede651f77
[14] https://github.com/xairy/linux/commit/700fb65580efc049133628e7b9f65453bb686231

Posted in Security | Tagged , , , , , , , , , , | 3 Comments

Speculations on Intel SGX Card

One of the exciting things Intel has brought to RSA 2019 is Intel SGX Card [2]. Yet there is not much information about this coming hardware. This post collects some related documentation from Intel and speculates what could happen within Intel SGX Card with a focus on software architecture, cloud deployment, and security analysis. NOTE: all the figures come from public Intel blog posts and documentation, and there is no warranty for my speculations on Intel SGX Card! Read with caution!

1. Intel SGX Card

According to [2], “Though Intel SGX technology will be available on future multi-socket Intel® Xeon® Scalable processors, there is pressing demand for its security benefits in this space today. Intel is accelerating deployment of Intel SGX technology for the vast majority of cloud servers deployed today with the Intel SGX Card. Additional benefits offer access to larger, non-enclave memory spaces, and some additional side-channel protections when compartmentalizing sensitive data to a separate processor and associated cache.

Simply put, Intel SGX Card is introduced to address 3 problems on SGX usage within cloud:

  1. Older servers/CPUs that do not support SGX
  2. Small EPC memory pool
  3. Side-channel attacks

Accordingly, Intel SGX Card is designed as a PCIe card, which can be plugged into old servers. This solves the first problem. But what about the second and the third problems? How could Intel SGX Card have larger EPC memory pool and defend against side-channel attacks? To answer these questions, we need to look into the internals of Intel SGX Card.

2. Intel VCA

According to [1], Intel SGX Card is actually built upon Intel VCA, the Intel® Visual Compute Accelerator (Intel® VCA) card [3]. Moreover, “Intel VCA is a purpose-built accelerator designed to boost performance of visual computing workloads like media transcoding, object recognition and tracking, and cloud gaming, originally developed as a way to improve video creation and delivery. In the Intel® SGX Card, the graphics accelerator has been disabled and the system re-optimized specifically for security purposes. In order to take advantage of Intel SGX technology, three Intel Xeon E processors are hosted in the card, which can fit inside existing, multi-socket server platforms being used in data centers today.

Alright, so Intel SGX Card is Intel VCA with graphics accelerator disabled essentially. Now it is time to learn what Intel VCA is. After some digging online, I found 2 precious documentations describing hardware specification [4] and software guide [5] respectively. Readers are encouraged to give a careful read on these documentations. Below is the TL;DR version.

vca-hw-dimm

The Intel VCA (or VCA 2) is a PCIe card with 3 Xeon CPUs. As shown in the figure above, each CPU has its own DRAM, instead of sharing RAMs. The internal architecture below shows better the nature of this card: 3 computers within a PCIe card.

vca2-hw-internal

These 3 CPUs do not only have their own DRAMs but also PCH chipsets and Flashes. They are connected and multiplexed by a PCIe bridge connecting with the host machine. Note that VCA 2 also supports optional NVM storage M2, as shown in the figure above. Let’s take a look at the software stack.

vca-sw-arch

Did I say “3 computers within a PCIe card”? I actually mean it. Each CPU within the VCA card runs its own software stack, including UEFI/BIOS, operating system, drivers, SDKs, and applications. These operating systems could be Linux or Windows. Hypervisors are also supported including KVM and Xen. Even “better”, each CPU is also equipped with Intel SPS and ME. If you count ME as a microcomputer as well, now we have 3 microcomputers running inside 3 computers within 1 PCIe card.

vca-sw-net

Each computer within VCA is also called a node. Therefore, there are 3 nodes within 1 VCA card. Unlike typical PCIe cards, VCA exposes itself as virtual network interfaces to the host machine. For example, 2 VCA cards (6 nodes) add 6 different Virt eth interfaces to the host machine, as shown in the figure above. These Virt eth interfaces are implemented as MMIO over PCIe. Given that each node is indeed an independent computer system with full software stacks, this virtual network interface concept might be a reasonable abstraction. I was worried about the overhead of going through TCP/IP stack. Then I realize that Intel could provide dedicated drivers on both the host and the node side to bypass the TCP/IP stack, which is very possible, as suggested by those VCA drivers. It would be interesting to see what “packet” is sent and received from these virtual NICs. To support high bandwidth and throughput, the MMIO region is 4GB minimum. This means each node takes a 4GB memory space from the main system memory, as well as its internal memory.

3. Speculations on Intel SGX Card

Once we have some basic understanding of Intel VCA, we can now speculate what Intel SGX Card could be. Depending on what Intel meant by “disabling graphics accelerators“, it could be removing those VCA drivers and SDK within each node. Once we did that, we would have a prototype Intel SGX Card, where 3 SGX nodes run a typical operating system connecting with the host machine via PCIe. Now, what could we do?

To reuse most of the software stack developed for VCA already, I probably would keep the virtual network interface instead of creating a different device within the host machine. As such, the host still talks with the SGX card in virt eth. Within each node of the SGX card, we could install the typical Intel SGX PSW and SDK without any trouble since each node is an SGX machine. Then each node has all the necessary runtime to support SGX applications. On the host side, we could still install Intel SGX SDK to support compilation “locally”, although we might not be able to install PSW assuming an old Xeon processor. But this is not a problem because we will relay the compiled SGX application to the SGX card. To achieve this, a new SGX kernel driver is needed on the host machine to send the SGX application to one node within the SGX card via the virt eth interface.

So far we have speculated how to use Intel SGX card within a host (or server). It is time to review the design goals of Intel SGX card again:

  1. Enable older servers to support SGX
  2. Enlarge EPC memory pool
  3. Protect from side-channel attacks

The first problem can be achieved easily with the PCIe design and the fact that each node within the Intel SGX card is a self-contained SGX-enabled computer. However, the scalability of this solution is still limited by the number of PCIe (x16) slots available within a server and the number of CPU nodes within an Intel SGX Card. The number of PCIe slots is also limited by the power supply within the system. Unless we are talking about some crazy GPU-in-favor motherboard [6], 4 PCIe x16 slots seem to be a reasonable estimation. Multiplied by 3 (number of nodes within an Intel SGX card), we would have 12 SGX-enabled CPU nodes available within a server.

The second goal is a byproduct of the independent DRAM of each node within the Intel SGX card. Recall that each node has a maximum 32GB memory available. If Intel SGX card is based upon Intel VCA 2, each node then has maximum 64GB memory available. Because this 32GB (or 64GB) memory is dedicated to the node for SGX computation instead of a portion from the main system memory within the server, we can anticipate the EPC to be large for each node. For instance, a typical EPC memory size within an SGX-enable machine is 128MB. Because of the Merkle Tree used to maintain the integrity of each page and other housekeeping metadata, only around 90MB is for real enclave allocations. This means the overhead of EPC is 1/4 in general. If we assume 32GB for each node within an Intel SGX card, we could easily have 16GB for EPC, among of which 4GB is used for EPC management and 12GB for enclave allocations. Why 16GB? You might ask. Well, remember that each node is a running system. We need some memory both for OS and applications, including the non-enclave part of SGX applications. Moreover, due to the MMIO requirement, a 4GB memory space is reserved on both the main system memory and node’s memory for each node. As a result, we have roughly 12GB left for OS and applications for each node. Of course, we could push more but you get the point. We will see the EPC size once Intel SGX card is available.

The third goal is described as “additional benefit” of using Intel SGX card. Because all the 3 nodes within an Intel SGX card have its independent RAM and cache (which are also separated from the main system if the host supports SGX as well), we definitely could have better security guarantees for SGX applications. First, SGX applications can run within a node, thus isolating themselves from other processes running on the main system. Second, different SGX applications can run on different nodes, thus reducing the impact of enclave-based malware or side-channel attacks. Everything sounds good! What could possibly go wrong?

4. Speculations on security

First of all, SGX applications running within Intel SGX card is still vulnerable to whatever attacks as before, because each node within the card is still a computer system with a full software stack. Unless this whole software stack is within the TCB, an SGX application is still vulnerable to attacks from all other processes and even the OS or hypervisors running within the same node. From SGX application point of view, nothing is changed, really.

The other question is how a cloud service provider (CSP) could distribute SGX workload? A straightforward solution would be based on load balancing, where a CSP distributes different SGX applications to different nodes for performance considerations regardless of security levels of different end users. Again, this is no different with an SGX-enabled host machine running different SGX applications from different users. Another solution would be mapping a node with one user, meaning that SGX applications from the same user will run within the same node. While this solution reduces attacks from other end users, we can easily run into scalability issues given the limited number of nodes available within a system and a possibly large number of end users. The other problem of this solution would be load unbalancing. User A might only have 1 SGX applications running on node N-A while user B might have 100 SGX applications running on node N-B. I am not surprised if user B yells at the cloud.

That is being said I do not think Intel would take either approach. Instead, a VM-based approach might be used, where SGX applications from the same user run within the same VM and different users have different VMs. We can then achieve load balancing easily by assigning a similar number of VMs to each node. This approach is technically doable since we have seen SGX support for KVM [7] and nodes within Intel SGX card support KVM too. It is also possible that Clear Linux [8] will be used to reduce the overhead of VM by using KVM-based containters. The only question is if VM or container is enough to isolate potential attacks from other cloud tenants, e.g., cache-based attacks, and defend against attacks from OS and hypervisors, e.g., control-channel attacks.

5. Conclusion

This post tries to speculate what Intel SGX card would look like and how it would be used within a cloud environment. I have no doubt that some of the speculations could be totally wrong once we are able to see the real product. Nevertheless, I hope this post could shed some light on this new security product and what could/should be done and what is still missing. All opinions are my own.

References:

[1] https://itpeernetwork.intel.com/sgx-data-protection-cloud-platforms/
[2] https://newsroom.intel.com/news/rsa-2019-intel-partner-ecosystem-offer-new-silicon-enabled-security-solutions/
[3] https://www.intel.com/content/www/us/en/products/servers/accelerators.html
[4] https://www.intel.com/content/dam/support/us/en/documents/server-products/server-accessories/VCA_Spec_HW_Users_Guide.pdf
[5] https://www.intel.com/content/dam/support/us/en/documents/server-products/server-accessories/VCA_SoftwareUserGuide.pdf
[6] https://www.pcgamer.com/asus-has-a-motherboard-that-supports-up-to-19-gpus/
[7] https://github.com/intel/kvm-sgx
[8] https://clearlinux.org/

Posted in Security | Tagged , , , , , | Leave a comment

Syscall hijacking in 2019

Whether you need to implement a kernel rootkit or inspect syscalls for intrusion detection, in a lot of cases, you might need to hijack syscall in a kernel module. This post summorizes detailed procedures and provides a working example for both x86_64 and aarch64 architectures on recent kernel versions. All the code can be found at [1]. Happy hacking~

1. Syscall hijacking

There are different ways to hijack syscall as summerized by [3]. The essense is to modify the sys_call_table within the kernel to overwrite the original address of certain syscall to be the one implemented by yourself. Here we use kallsyms_lookup_name to find the location of sys_call_table. However, 2 more things (or maybe 3 depending on the architecture and we will talk about that later) need to be considered. First, is the page of sys_call_table writable? Recent kernels have enforced read-only (RO) on text pages. So we need to make the page writable again (RW) in our kernel module. Second, SMP environment require us to synchronize the sys_call_table modification with all cores. This can be achieved by disabling preemption.

2. Hijacking read syscall

Once we hijack a certain syscall, we are able to see all the parameters from the user space. For example, we are able to see the file discripter (FD), user buffer, and number of bytes (count) within the read syscall. The real meat of syscall hijacking comes from what we could do using these parameters. As a proof-of-concept (PoC), we trace back the file name from FD and prevent users from reading the specific file by returning something else. In our implementations, we stop users from reading the README.md file (yup) and return bunch of 7s. The good news is we limit our target process to be the testing procedure instead of any process. Since syscall happens within the process context, “current” is always available. Accordingly, intrusion detection, system profiling, and etc are made possible thanks to different syscall parameters.

3. Architecture difference

Architecture makes a difference. Intel has a control bit within CR0 to write-protect the read-only memory on x86_64. As a result, besides adding the W permission to the sys_call_table page, we also need to disable the write protection within CR0. ARM, on the other hand, does not have this constraint. On the aarch64 board with kernel 4.4 that I used, the text page also allows for write.

Nevertheless, in case of page write protection, we will have to need to implemement set_memory_rw and set_memory_ro (for recovery) by ourselves, because none of these functions is exported to kernel modules [3]. Essentially, we need to call apply_to_page_range and implement flash_tlb_kernel_range within our kernel module. This also reminds me a potential bug within the current x86_64 implementation, where a TLB flush should be needed after we update the PTE to synchronize other CPU cores by triggering IPIs.

References:

[1] https://github.com/daveti/syscallh
[2] https://blog.trailofbits.com/2019/01/17/how-to-write-a-rootkit-without-really-trying/
[3] https://lxr.missinglinkelectronics.com/linux/arch/arm64/mm/pageattr.c

Posted in OS, Security | Tagged , , , , | 1 Comment

Kernel build on Nvidia Jetson TX1

This post introduces native Linux kernel built on the Nvidia Jetson TX1 dev board. The scripts are based on the jetsonhacks/buildJetsonTX1Kernel tools. Our target is JetPack 3.3 (the latest SDK supporting TX1 by the time of writing). All the scripts are available at [2]. Have fun~

1. Kernel build on TX1

Nvidia devtalk has some general information about kernel build for TX1 [3], including both native build and cross compile (e.g., from a TFTP server). Here we focus on the native build. The procedure roughly follows a) installing dependencies, b) downloading the kernel src, c) generating config, d) making build, and e) installing the new kernel image.

Unlike a typical kernel build on x86-64 architecture, the most confusing part would be to figure out the right kernel version supported by the board. TX1 uses Nvidia L4T [5], which is a customized kernel for the Tegra SoC. Depending on the JetPack version running on your TX1 board, different L4T version is needed. As you can tell, there are a lot of prepairations needed to be done before we could kick off the build.

2. buildJetsonTX1Kernel

JetsonHacks provides a bunch of scripts to ease and automate differen steps mentioned above, called buildJetsonTX1Kernel [1]. By detecting the tegra chip id (sysfs) and tegra release note (/etc), these scripts can figure out the model of the board (e.g., TX1) and the version of JetPack installed (e.g., 3.2), thus download the right version of L4T kernel source. Please refer to [4] for a detailed usage of these scripts.

3. One-click build

The buildJetsonTX1Kernel scripts are great and useful, but somehow I realized that my TX1 setup was different and I needed some customizations to make my life (hopefully yours too) easier [2]. The first issue was the usage of JetPack 3.3. I have submitted a patch to JetsonHacks for JetsonUtilities to correctly detect this latest JetPack version supported by TX1. Unfortunately, buildJetsonTX1Kernel scripts still only support up to JetPack 3.2. Things get more complicated when both JetPack 3.2 and 3.3 use the same L4T kernel version.

The original scripts assume the usage of eMMC to hold all the kernel build artifacts, which does not hold in my TX1 environment where a 64G SD card is mounted. Accordingly, I have updated all the scripts to use my SD card instead of the default /usr/src/ directory.

I have also created a one-click build script (kbuild.sh) to automate the whole process within one script. Simply running ./kbuild.sh would generate a new kernel image ready to reboot. I have also replaced xconfig with menuconfig since I use SSH to connect with TX1. A simple hello world kernel module is also included as a starting point for module development.

References:

[1] https://github.com/jetsonhacks/buildJetsonTX1Kernel
[2] https://github.com/daveti/buildJetsonTX1Kernel
[3] https://devtalk.nvidia.com/default/topic/762653/-howto-build-own-kernel-for-jetson-tk1/
[4] https://www.jetsonhacks.com/2018/04/21/build-kernel-and-modules-nvidia-jetson-tx1/
[5] https://developer.nvidia.com/embedded/linux-tegra

Posted in Embedded System, gpu, OS | Tagged , , , , , , , | Leave a comment

Setting up Nvidia Jetson TX1

Starting from this post, I will share my learning and hacking experience on Nvidia Jetson TX1 dev board. This post discusses the installation issue of JetPack [4] and post-installation configurations for TX1. We assume users follow the JetPack 3.3 installation guide to setup the TX1.

1. DHCP Issue

One of the two possible configurations to setup JetPack on TX1 is to use DHCP, where the host machine is the DHCP server and the TX1 is the client. This connecting model is needed when there is no switch available and only the host machine has the Internet connection. In my case, the host machine connects with the Internet with Wifi and the Eithernet port is used to connect with TX1. Everything looks fine until TX1 tries to get an IP address from the host. “can’t determine the target IP” will be returned from the terminal and all the following JetPack installation on the TX1 will fail (although we have already flashed the L4T to the TX1 successfully). Turns out this is a known bug due to the argument changes within nmcli between Ubuntu 14.04 and 16.04 [1]. A detailed workaround is also provided there:

image1

Although the issue was reported on JetPack 3.2 for TX2, JetPack 3.3 still has this issue on TX1. JetPack 4.0 hopefully would fix this configuration bug.

2. Mount SD Card

TX1 comes with 16G eMMC storage. After full installation of JetPack, only 5.3G is left. As a result, we need extra storage to do something useful, e.g., compiling the Linux kernel on TX1. Again, the devtalk forum has a good discussion [2]. I used gparted to partition and format a 64G SD card with EXT4. Then find the UUID using blkid. Once we have the UUID for the new partition, we can put it into /etc/fstab for auto mounting.

image1

3. Setup An Account

After full deployment of JetPack on TX1, we have 2 accounts ready for use “ubuntu/ubuntu” and “nvidia/nvidia”. We can use the later one to do CUDA development. However, to support multiple users on the board, we need to create new users using adduser. The first thing after logging as a new user on TX1 might be “nvcc not found” – Duh! Since “nvidia” has CUDA environment setup already, let’s copy its .bashrc and .profile into the new account. We can then compile CUDA program using nvcc. But when we run the CUDA program, it is seg fault – “unhandled level 3 permission fault (11)”:

image1

Turns out all the GPU device files under /dev (/dev/nvhost-*) belong to group “video”, and id on “nvidia” shows this group as well. Adding the new user into “video” group (sudo usermod -aG video newuser) solves this permission issue.

References:

[1] https://devtalk.nvidia.com/default/topic/1023680/jetson-tx2/dhcp-not-working-and-no-internet-connection-from-tx2-after-installing-/1
[2] https://devtalk.nvidia.com/default/topic/1009267/jetson-tx2/mount-sd-card-into-jetson/
[3] https://devblogs.nvidia.com/even-easier-introduction-cuda/
[4] https://developer.nvidia.com/embedded/jetpack

Posted in Embedded System, gpu | Tagged , , , , , , , | Leave a comment

Hacking Valgrind

This post talks about 3 commits I have recently added into my own valgrind tree [1], including the support for fsgsbase instructions, rdrand/rdseed instructions, and adding a new trapdoor (client request) to support gdb-like add-symbol-file command. Note that all these new features are not available in the mainstream valgrind by the time of writing, and I am not planning to work on mainstreaming anyway. Nevertheless, feel free the patch your own valgrind if needed. My work is supported by Fortanix [5].

1. Support for fsgsbase

fsgsbase instructions allow user space to read [6] or write [7] the FS or GS register base on x86_64 architecture, enabling indirect addressing mode using FS/GS, such as “mov %GS:0x10, %rax”. Surprisingly, the most challenging part (for me) was the decoding of amd64/x86_64 instructions. I am not interested in repeating how fucked-up this encoding mechanism is but only remind readers that opcode is USELESS on this architecture. Anyway, once we figure how to decode fsgsbase instructions in valgrind, we are able to generate the corresponding VEX IRs.

Although FS/GS base update from the user space is not supported, valgrind has FS/GS base registers built inside the guest VM state. Valgrind even hooks arch_prctl() syscall to update those guest registers. For us, we need to remove all those constrains assuming a constant FS/GS, and allow fsgsbase instructions to update FS/GS base registers in the guest. Because valgrind is emulating FS/GS in the guest, there is no need to check for the real hardware support for these instructions on the host. For details of the patch, please check [2].

2. Support for rdrand/rdseed

rdrand call the TRNG available inside the CPU to generate a random number [8]. rdseed is similar although focusing on providing random seed for PRNG [9]. The difference between them can be found at [10]. Unlike the fsgsbase instructions, valgrind needs to check whether or not the host CPU supports rdrand/rdseed when encountering these instructions in the client program, and delegate the acutal execution to the real CPU on the host. (Although we could emulate these instructions in valgrind as well, faithfully executing them is more preferred especially when the CPU supports these instructions.)

Once we have extended CPUID to detect these instructions on the host CPU, we can start to write down “dirtyhelpers” for rdrand/rdseed, which are the actual rdrand/rdseed instructions running on the real CPU. Because these instructions may fail (non-block, carry flag not set), we need to do a loop on the carry flag, making sure we return the right rand/seed to the guest. Similarly, a sane implementation of rdrand/rdseed within the client program should also do a loop on the carry flag. This means we need set the carry flag in the rflags of the guest VM state to help the client program move forward. Turns out it is not easy to do this in valgrind, because the rflags is not listed as other registers of the guest VM state explicitly. Instead, all these flags need to be computed based on the operation of the current instruction.

BTW, rdrand/rdseed is also a good example of the pathologicial design of x86_64 instruction encoding. They have the same opcode as cmpxchg8b and cmpxchg16b. For details of this patch, please check [3].

3. A new trapdoor: add-symbol-file

GDB supports loading symbols manually using add-symbol-file command. It is useful when GDB could not figure out what was loaded at certain VA range (thus ??? in the backtrace). Unfortunately, valgrind does not have such a mechanism. As a result, valgrind could not recognize any memory mapping not directly triggered by mmap() syscall, e.g., memcpy from VA1 to VA2. It also means valgrind could not recognize a binary doing a reloation by itself after the first mmap(), such as loader. Based on these considerations, we add a new valgrind trapdoor (client request) — VALGRIND_ADD_SYMBOL_FILE, allowing a client program to pass the memory mapping information to valgrind. It accepts 3 arguments, the file name of the mapping, e.g., a shared object, the starting mapping address (page aligned), and the length of the mapping. For details of this this patch, please check [4].

Reference:

[1] https://github.com/daveti/valgrind
[2] https://github.com/daveti/valgrind/commit/16ccd1974ce2ca13e10adac9906de5bc689c509d
[3] https://github.com/daveti/valgrind/commit/5986cc4a0c6bf2d41822df15e8f074437c32e391
[4] https://github.com/daveti/valgrind/commit/baa7d6b344a539b8842d7c157ab67af990213500
[5] https://fortanix.com/
[6] https://www.felixcloutier.com/x86/RDFSBASE:RDGSBASE.html
[7] https://www.felixcloutier.com/x86/WRFSBASE:WRGSBASE.html
[8] https://www.felixcloutier.com/x86/RDRAND.html
[9] https://www.felixcloutier.com/x86/RDSEED.html
[10] https://software.intel.com/en-us/blogs/2012/11/17/the-difference-between-rdrand-and-rdseed

Posted in Dave's Tools, Programming | Tagged , , , , , , , , , | Leave a comment