Speculations on Intel SGX Card

One of the exciting things Intel has brought to RSA 2019 is Intel SGX Card [2]. Yet there is not much information about this coming hardware. This post collects some related documentation from Intel and speculates what could happen within Intel SGX Card with a focus on software architecture, cloud deployment, and security analysis. NOTE: all the figures come from public Intel blog posts and documentation, and there is no warranty for my speculations on Intel SGX Card! Read with caution!

1. Intel SGX Card

According to [2], “Though Intel SGX technology will be available on future multi-socket Intel® Xeon® Scalable processors, there is pressing demand for its security benefits in this space today. Intel is accelerating deployment of Intel SGX technology for the vast majority of cloud servers deployed today with the Intel SGX Card. Additional benefits offer access to larger, non-enclave memory spaces, and some additional side-channel protections when compartmentalizing sensitive data to a separate processor and associated cache.

Simply put, Intel SGX Card is introduced to address 3 problems on SGX usage within cloud:

  1. Older servers/CPUs that do not support SGX
  2. Small EPC memory pool
  3. Side-channel attacks

Accordingly, Intel SGX Card is designed as a PCIe card, which can be plugged into old servers. This solves the first problem. But what about the second and the third problems? How could Intel SGX Card have larger EPC memory pool and defend against side-channel attacks? To answer these questions, we need to look into the internals of Intel SGX Card.

2. Intel VCA

According to [1], Intel SGX Card is actually built upon Intel VCA, the Intel® Visual Compute Accelerator (Intel® VCA) card [3]. Moreover, “Intel VCA is a purpose-built accelerator designed to boost performance of visual computing workloads like media transcoding, object recognition and tracking, and cloud gaming, originally developed as a way to improve video creation and delivery. In the Intel® SGX Card, the graphics accelerator has been disabled and the system re-optimized specifically for security purposes. In order to take advantage of Intel SGX technology, three Intel Xeon E processors are hosted in the card, which can fit inside existing, multi-socket server platforms being used in data centers today.

Alright, so Intel SGX Card is Intel VCA with graphics accelerator disabled essentially. Now it is time to learn what Intel VCA is. After some digging online, I found 2 precious documentations describing hardware specification [4] and software guide [5] respectively. Readers are encouraged to give a careful read on these documentations. Below is the TL;DR version.


The Intel VCA (or VCA 2) is a PCIe card with 3 Xeon CPUs. As shown in the figure above, each CPU has its own DRAM, instead of sharing RAMs. The internal architecture below shows better the nature of this card: 3 computers within a PCIe card.


These 3 CPUs do not only have their own DRAMs but also PCH chipsets and Flashes. They are connected and multiplexed by a PCIe bridge connecting with the host machine. Note that VCA 2 also supports optional NVM storage M2, as shown in the figure above. Let’s take a look at the software stack.


Did I say “3 computers within a PCIe card”? I actually mean it. Each CPU within the VCA card runs its own software stack, including UEFI/BIOS, operating system, drivers, SDKs, and applications. These operating systems could be Linux or Windows. Hypervisors are also supported including KVM and Xen. Even “better”, each CPU is also equipped with Intel SPS and ME. If you count ME as a microcomputer as well, now we have 3 microcomputers running inside 3 computers within 1 PCIe card.


Each computer within VCA is also called a node. Therefore, there are 3 nodes within 1 VCA card. Unlike typical PCIe cards, VCA exposes itself as virtual network interfaces to the host machine. For example, 2 VCA cards (6 nodes) add 6 different Virt eth interfaces to the host machine, as shown in the figure above. These Virt eth interfaces are implemented as MMIO over PCIe. Given that each node is indeed an independent computer system with full software stacks, this virtual network interface concept might be a reasonable abstraction. I was worried about the overhead of going through TCP/IP stack. Then I realize that Intel could provide dedicated drivers on both the host and the node side to bypass the TCP/IP stack, which is very possible, as suggested by those VCA drivers. It would be interesting to see what “packet” is sent and received from these virtual NICs. To support high bandwidth and throughput, the MMIO region is 4GB minimum. This means each node takes a 4GB memory space from the main system memory, as well as its internal memory.

3. Speculations on Intel SGX Card

Once we have some basic understanding of Intel VCA, we can now speculate what Intel SGX Card could be. Depending on what Intel meant by “disabling graphics accelerators“, it could be removing those VCA drivers and SDK within each node. Once we did that, we would have a prototype Intel SGX Card, where 3 SGX nodes run a typical operating system connecting with the host machine via PCIe. Now, what could we do?

To reuse most of the software stack developed for VCA already, I probably would keep the virtual network interface instead of creating a different device within the host machine. As such, the host still talks with the SGX card in virt eth. Within each node of the SGX card, we could install the typical Intel SGX PSW and SDK without any trouble since each node is an SGX machine. Then each node has all the necessary runtime to support SGX applications. On the host side, we could still install Intel SGX SDK to support compilation “locally”, although we might not be able to install PSW assuming an old Xeon processor. But this is not a problem because we will relay the compiled SGX application to the SGX card. To achieve this, a new SGX kernel driver is needed on the host machine to send the SGX application to one node within the SGX card via the virt eth interface.

So far we have speculated how to use Intel SGX card within a host (or server). It is time to review the design goals of Intel SGX card again:

  1. Enable older servers to support SGX
  2. Enlarge EPC memory pool
  3. Protect from side-channel attacks

The first problem can be achieved easily with the PCIe design and the fact that each node within the Intel SGX card is a self-contained SGX-enabled computer. However, the scalability of this solution is still limited by the number of PCIe (x16) slots available within a server and the number of CPU nodes within an Intel SGX Card. The number of PCIe slots is also limited by the power supply within the system. Unless we are talking about some crazy GPU-in-favor motherboard [6], 4 PCIe x16 slots seem to be a reasonable estimation. Multiplied by 3 (number of nodes within an Intel SGX card), we would have 12 SGX-enabled CPU nodes available within a server.

The second goal is a byproduct of the independent DRAM of each node within the Intel SGX card. Recall that each node has a maximum 32GB memory available. If Intel SGX card is based upon Intel VCA 2, each node then has maximum 64GB memory available. Because this 32GB (or 64GB) memory is dedicated to the node for SGX computation instead of a portion from the main system memory within the server, we can anticipate the EPC to be large for each node. For instance, a typical EPC memory size within an SGX-enable machine is 128MB. Because of the Merkle Tree used to maintain the integrity of each page and other housekeeping metadata, only around 90MB is for real enclave allocations. This means the overhead of EPC is 1/4 in general. If we assume 32GB for each node within an Intel SGX card, we could easily have 16GB for EPC, among of which 4GB is used for EPC management and 12GB for enclave allocations. Why 16GB? You might ask. Well, remember that each node is a running system. We need some memory both for OS and applications, including the non-enclave part of SGX applications. Moreover, due to the MMIO requirement, a 4GB memory space is reserved on both the main system memory and node’s memory for each node. As a result, we have roughly 12GB left for OS and applications for each node. Of course, we could push more but you get the point. We will see the EPC size once Intel SGX card is available.

The third goal is described as “additional benefit” of using Intel SGX card. Because all the 3 nodes within an Intel SGX card have its independent RAM and cache (which are also separated from the main system if the host supports SGX as well), we definitely could have better security guarantees for SGX applications. First, SGX applications can run within a node, thus isolating themselves from other processes running on the main system. Second, different SGX applications can run on different nodes, thus reducing the impact of enclave-based malware or side-channel attacks. Everything sounds good! What could possibly go wrong?

4. Speculations on security

First of all, SGX applications running within Intel SGX card is still vulnerable to whatever attacks as before, because each node within the card is still a computer system with a full software stack. Unless this whole software stack is within the TCB, an SGX application is still vulnerable to attacks from all other processes and even the OS or hypervisors running within the same node. From SGX application point of view, nothing is changed, really.

The other question is how a cloud service provider (CSP) could distribute SGX workload? A straightforward solution would be based on load balancing, where a CSP distributes different SGX applications to different nodes for performance considerations regardless of security levels of different end users. Again, this is no different with an SGX-enabled host machine running different SGX applications from different users. Another solution would be mapping a node with one user, meaning that SGX applications from the same user will run within the same node. While this solution reduces attacks from other end users, we can easily run into scalability issues given the limited number of nodes available within a system and a possibly large number of end users. The other problem of this solution would be load unbalancing. User A might only have 1 SGX applications running on node N-A while user B might have 100 SGX applications running on node N-B. I am not surprised if user B yells at the cloud.

That is being said I do not think Intel would take either approach. Instead, a VM-based approach might be used, where SGX applications from the same user run within the same VM and different users have different VMs. We can then achieve load balancing easily by assigning a similar number of VMs to each node. This approach is technically doable since we have seen SGX support for KVM [7] and nodes within Intel SGX card support KVM too. It is also possible that Clear Linux [8] will be used to reduce the overhead of VM by using KVM-based containters. The only question is if VM or container is enough to isolate potential attacks from other cloud tenants, e.g., cache-based attacks, and defend against attacks from OS and hypervisors, e.g., control-channel attacks.

5. Conclusion

This post tries to speculate what Intel SGX card would look like and how it would be used within a cloud environment. I have no doubt that some of the speculations could be totally wrong once we are able to see the real product. Nevertheless, I hope this post could shed some light on this new security product and what could/should be done and what is still missing. All opinions are my own.


[1] https://itpeernetwork.intel.com/sgx-data-protection-cloud-platforms/
[2] https://newsroom.intel.com/news/rsa-2019-intel-partner-ecosystem-offer-new-silicon-enabled-security-solutions/
[3] https://www.intel.com/content/www/us/en/products/servers/accelerators.html
[4] https://www.intel.com/content/dam/support/us/en/documents/server-products/server-accessories/VCA_Spec_HW_Users_Guide.pdf
[5] https://www.intel.com/content/dam/support/us/en/documents/server-products/server-accessories/VCA_SoftwareUserGuide.pdf
[6] https://www.pcgamer.com/asus-has-a-motherboard-that-supports-up-to-19-gpus/
[7] https://github.com/intel/kvm-sgx
[8] https://clearlinux.org/

Posted in Security | Tagged , , , , , | Leave a comment

Syscall hijacking in 2019

Whether you need to implement a kernel rootkit or inspect syscalls for intrusion detection, in a lot of cases, you might need to hijack syscall in a kernel module. This post summorizes detailed procedures and provides a working example for both x86_64 and aarch64 architectures on recent kernel versions. All the code can be found at [1]. Happy hacking~

1. Syscall hijacking

There are different ways to hijack syscall as summerized by [3]. The essense is to modify the sys_call_table within the kernel to overwrite the original address of certain syscall to be the one implemented by yourself. Here we use kallsyms_lookup_name to find the location of sys_call_table. However, 2 more things (or maybe 3 depending on the architecture and we will talk about that later) need to be considered. First, is the page of sys_call_table writable? Recent kernels have enforced read-only (RO) on text pages. So we need to make the page writable again (RW) in our kernel module. Second, SMP environment require us to synchronize the sys_call_table modification with all cores. This can be achieved by disabling preemption.

2. Hijacking read syscall

Once we hijack a certain syscall, we are able to see all the parameters from the user space. For example, we are able to see the file discripter (FD), user buffer, and number of bytes (count) within the read syscall. The real meat of syscall hijacking comes from what we could do using these parameters. As a proof-of-concept (PoC), we trace back the file name from FD and prevent users from reading the specific file by returning something else. In our implementations, we stop users from reading the README.md file (yup) and return bunch of 7s. The good news is we limit our target process to be the testing procedure instead of any process. Since syscall happens within the process context, “current” is always available. Accordingly, intrusion detection, system profiling, and etc are made possible thanks to different syscall parameters.

3. Architecture difference

Architecture makes a difference. Intel has a control bit within CR0 to write-protect the read-only memory on x86_64. As a result, besides adding the W permission to the sys_call_table page, we also need to disable the write protection within CR0. ARM, on the other hand, does not have this constraint. On the aarch64 board with kernel 4.4 that I used, the text page also allows for write.

Nevertheless, in case of page write protection, we will have to need to implemement set_memory_rw and set_memory_ro (for recovery) by ourselves, because none of these functions is exported to kernel modules [3]. Essentially, we need to call apply_to_page_range and implement flash_tlb_kernel_range within our kernel module. This also reminds me a potential bug within the current x86_64 implementation, where a TLB flush should be needed after we update the PTE to synchronize other CPU cores by triggering IPIs.


[1] https://github.com/daveti/syscallh
[2] https://blog.trailofbits.com/2019/01/17/how-to-write-a-rootkit-without-really-trying/
[3] https://lxr.missinglinkelectronics.com/linux/arch/arm64/mm/pageattr.c

Posted in OS, Security | Tagged , , , , | 1 Comment

Kernel build on Nvidia Jetson TX1

This post introduces native Linux kernel built on the Nvidia Jetson TX1 dev board. The scripts are based on the jetsonhacks/buildJetsonTX1Kernel tools. Our target is JetPack 3.3 (the latest SDK supporting TX1 by the time of writing). All the scripts are available at [2]. Have fun~

1. Kernel build on TX1

Nvidia devtalk has some general information about kernel build for TX1 [3], including both native build and cross compile (e.g., from a TFTP server). Here we focus on the native build. The procedure roughly follows a) installing dependencies, b) downloading the kernel src, c) generating config, d) making build, and e) installing the new kernel image.

Unlike a typical kernel build on x86-64 architecture, the most confusing part would be to figure out the right kernel version supported by the board. TX1 uses Nvidia L4T [5], which is a customized kernel for the Tegra SoC. Depending on the JetPack version running on your TX1 board, different L4T version is needed. As you can tell, there are a lot of prepairations needed to be done before we could kick off the build.

2. buildJetsonTX1Kernel

JetsonHacks provides a bunch of scripts to ease and automate differen steps mentioned above, called buildJetsonTX1Kernel [1]. By detecting the tegra chip id (sysfs) and tegra release note (/etc), these scripts can figure out the model of the board (e.g., TX1) and the version of JetPack installed (e.g., 3.2), thus download the right version of L4T kernel source. Please refer to [4] for a detailed usage of these scripts.

3. One-click build

The buildJetsonTX1Kernel scripts are great and useful, but somehow I realized that my TX1 setup was different and I needed some customizations to make my life (hopefully yours too) easier [2]. The first issue was the usage of JetPack 3.3. I have submitted a patch to JetsonHacks for JetsonUtilities to correctly detect this latest JetPack version supported by TX1. Unfortunately, buildJetsonTX1Kernel scripts still only support up to JetPack 3.2. Things get more complicated when both JetPack 3.2 and 3.3 use the same L4T kernel version.

The original scripts assume the usage of eMMC to hold all the kernel build artifacts, which does not hold in my TX1 environment where a 64G SD card is mounted. Accordingly, I have updated all the scripts to use my SD card instead of the default /usr/src/ directory.

I have also created a one-click build script (kbuild.sh) to automate the whole process within one script. Simply running ./kbuild.sh would generate a new kernel image ready to reboot. I have also replaced xconfig with menuconfig since I use SSH to connect with TX1. A simple hello world kernel module is also included as a starting point for module development.


[1] https://github.com/jetsonhacks/buildJetsonTX1Kernel
[2] https://github.com/daveti/buildJetsonTX1Kernel
[3] https://devtalk.nvidia.com/default/topic/762653/-howto-build-own-kernel-for-jetson-tk1/
[4] https://www.jetsonhacks.com/2018/04/21/build-kernel-and-modules-nvidia-jetson-tx1/
[5] https://developer.nvidia.com/embedded/linux-tegra

Posted in Embedded System, gpu, OS | Tagged , , , , , , , | Leave a comment

Setting up Nvidia Jetson TX1

Starting from this post, I will share my learning and hacking experience on Nvidia Jetson TX1 dev board. This post discusses the installation issue of JetPack [4] and post-installation configurations for TX1. We assume users follow the JetPack 3.3 installation guide to setup the TX1.

1. DHCP Issue

One of the two possible configurations to setup JetPack on TX1 is to use DHCP, where the host machine is the DHCP server and the TX1 is the client. This connecting model is needed when there is no switch available and only the host machine has the Internet connection. In my case, the host machine connects with the Internet with Wifi and the Eithernet port is used to connect with TX1. Everything looks fine until TX1 tries to get an IP address from the host. “can’t determine the target IP” will be returned from the terminal and all the following JetPack installation on the TX1 will fail (although we have already flashed the L4T to the TX1 successfully). Turns out this is a known bug due to the argument changes within nmcli between Ubuntu 14.04 and 16.04 [1]. A detailed workaround is also provided there:


Although the issue was reported on JetPack 3.2 for TX2, JetPack 3.3 still has this issue on TX1. JetPack 4.0 hopefully would fix this configuration bug.

2. Mount SD Card

TX1 comes with 16G eMMC storage. After full installation of JetPack, only 5.3G is left. As a result, we need extra storage to do something useful, e.g., compiling the Linux kernel on TX1. Again, the devtalk forum has a good discussion [2]. I used gparted to partition and format a 64G SD card with EXT4. Then find the UUID using blkid. Once we have the UUID for the new partition, we can put it into /etc/fstab for auto mounting.


3. Setup An Account

After full deployment of JetPack on TX1, we have 2 accounts ready for use “ubuntu/ubuntu” and “nvidia/nvidia”. We can use the later one to do CUDA development. However, to support multiple users on the board, we need to create new users using adduser. The first thing after logging as a new user on TX1 might be “nvcc not found” – Duh! Since “nvidia” has CUDA environment setup already, let’s copy its .bashrc and .profile into the new account. We can then compile CUDA program using nvcc. But when we run the CUDA program, it is seg fault – “unhandled level 3 permission fault (11)”:


Turns out all the GPU device files under /dev (/dev/nvhost-*) belong to group “video”, and id on “nvidia” shows this group as well. Adding the new user into “video” group (sudo usermod -aG video newuser) solves this permission issue.


[1] https://devtalk.nvidia.com/default/topic/1023680/jetson-tx2/dhcp-not-working-and-no-internet-connection-from-tx2-after-installing-/1
[2] https://devtalk.nvidia.com/default/topic/1009267/jetson-tx2/mount-sd-card-into-jetson/
[3] https://devblogs.nvidia.com/even-easier-introduction-cuda/
[4] https://developer.nvidia.com/embedded/jetpack

Posted in Embedded System, gpu | Tagged , , , , , , , | Leave a comment

Hacking Valgrind

This post talks about 3 commits I have recently added into my own valgrind tree [1], including the support for fsgsbase instructions, rdrand/rdseed instructions, and adding a new trapdoor (client request) to support gdb-like add-symbol-file command. Note that all these new features are not available in the mainstream valgrind by the time of writing, and I am not planning to work on mainstreaming anyway. Nevertheless, feel free the patch your own valgrind if needed. My work is supported by Fortanix [5].

1. Support for fsgsbase

fsgsbase instructions allow user space to read [6] or write [7] the FS or GS register base on x86_64 architecture, enabling indirect addressing mode using FS/GS, such as “mov %GS:0x10, %rax”. Surprisingly, the most challenging part (for me) was the decoding of amd64/x86_64 instructions. I am not interested in repeating how fucked-up this encoding mechanism is but only remind readers that opcode is USELESS on this architecture. Anyway, once we figure how to decode fsgsbase instructions in valgrind, we are able to generate the corresponding VEX IRs.

Although FS/GS base update from the user space is not supported, valgrind has FS/GS base registers built inside the guest VM state. Valgrind even hooks arch_prctl() syscall to update those guest registers. For us, we need to remove all those constrains assuming a constant FS/GS, and allow fsgsbase instructions to update FS/GS base registers in the guest. Because valgrind is emulating FS/GS in the guest, there is no need to check for the real hardware support for these instructions on the host. For details of the patch, please check [2].

2. Support for rdrand/rdseed

rdrand call the TRNG available inside the CPU to generate a random number [8]. rdseed is similar although focusing on providing random seed for PRNG [9]. The difference between them can be found at [10]. Unlike the fsgsbase instructions, valgrind needs to check whether or not the host CPU supports rdrand/rdseed when encountering these instructions in the client program, and delegate the acutal execution to the real CPU on the host. (Although we could emulate these instructions in valgrind as well, faithfully executing them is more preferred especially when the CPU supports these instructions.)

Once we have extended CPUID to detect these instructions on the host CPU, we can start to write down “dirtyhelpers” for rdrand/rdseed, which are the actual rdrand/rdseed instructions running on the real CPU. Because these instructions may fail (non-block, carry flag not set), we need to do a loop on the carry flag, making sure we return the right rand/seed to the guest. Similarly, a sane implementation of rdrand/rdseed within the client program should also do a loop on the carry flag. This means we need set the carry flag in the rflags of the guest VM state to help the client program move forward. Turns out it is not easy to do this in valgrind, because the rflags is not listed as other registers of the guest VM state explicitly. Instead, all these flags need to be computed based on the operation of the current instruction.

BTW, rdrand/rdseed is also a good example of the pathologicial design of x86_64 instruction encoding. They have the same opcode as cmpxchg8b and cmpxchg16b. For details of this patch, please check [3].

3. A new trapdoor: add-symbol-file

GDB supports loading symbols manually using add-symbol-file command. It is useful when GDB could not figure out what was loaded at certain VA range (thus ??? in the backtrace). Unfortunately, valgrind does not have such a mechanism. As a result, valgrind could not recognize any memory mapping not directly triggered by mmap() syscall, e.g., memcpy from VA1 to VA2. It also means valgrind could not recognize a binary doing a reloation by itself after the first mmap(), such as loader. Based on these considerations, we add a new valgrind trapdoor (client request) — VALGRIND_ADD_SYMBOL_FILE, allowing a client program to pass the memory mapping information to valgrind. It accepts 3 arguments, the file name of the mapping, e.g., a shared object, the starting mapping address (page aligned), and the length of the mapping. For details of this this patch, please check [4].


[1] https://github.com/daveti/valgrind
[2] https://github.com/daveti/valgrind/commit/16ccd1974ce2ca13e10adac9906de5bc689c509d
[3] https://github.com/daveti/valgrind/commit/5986cc4a0c6bf2d41822df15e8f074437c32e391
[4] https://github.com/daveti/valgrind/commit/baa7d6b344a539b8842d7c157ab67af990213500
[5] https://fortanix.com/
[6] https://www.felixcloutier.com/x86/RDFSBASE:RDGSBASE.html
[7] https://www.felixcloutier.com/x86/WRFSBASE:WRGSBASE.html
[8] https://www.felixcloutier.com/x86/RDRAND.html
[9] https://www.felixcloutier.com/x86/RDSEED.html
[10] https://software.intel.com/en-us/blogs/2012/11/17/the-difference-between-rdrand-and-rdseed

Posted in Dave's Tools, Programming | Tagged , , , , , , , , , | Leave a comment

Valgrind trapdoor and fun

Valgrind has a client request mechanism, which allows a client to pass some information back to valgrind. This includes asks valgrind to do a logging in its own environment, tells valgrind a range of VA being used as a new stack, and etc [1]. This mechansim is essentially a trapdoor built into VEX during the binary translation. We starts with a typical usage of valgrind trapdoor to add a logging into valgrind from the client. We then remove the dependency of valgrind header files and manually craft the trapdoor by ourselves. Note that this post is NOT about how to use the client request mechanism (read the damn manual:), nor on how to add a new client request (which I will talk in another post on valgrind hacking in general). Last, the code can be found at [2], and have fun:)

1. A typical usage

To use the valgrind trapdoor, we need to include the header: <valgrind/valgrind.h>. Let’s take VALGRIND_PRINTF as an example, which asks valgrind to add a logging for the client. The code below prints out the magic number in the valgrind logging:

#define valgrind_printf_fmt_str "daveti: trapdoor, magic [%d]\n"
int magic = 777;
/* Normal valgrind trapdoor */
ret = VALGRIND_PRINTF(valgrind_printf_fmt_str, magic);
printf("daveti: ret [%d]\n", ret);

When running with valgrind, the output looks like below:
**7382** daveti: trapdoor, magic [777]
daveti: ret [30]
The return value is the total length of the string printed out. Nothing fancy here but do remember that this logging is done by valgrind rather than the client program.

2. A better understanding

Alright, so what is that damn VALGRIND_PRINTF thing? Let’s have a deeper view using objdump (-S):

0000000000400527 :
  400527:	55                   	push   %rbp
  400528:	48 89 e5             	mov    %rsp,%rbp
  40052b:	48 81 ec a8 00 00 00 	sub    $0xa8,%rsp
  400532:	48 89 bd e8 fe ff ff 	mov    %rdi,-0x118(%rbp)
  400539:	48 89 b5 58 ff ff ff 	mov    %rsi,-0xa8(%rbp)
  400540:	48 89 95 60 ff ff ff 	mov    %rdx,-0xa0(%rbp)
  400547:	48 89 8d 68 ff ff ff 	mov    %rcx,-0x98(%rbp)
  40054e:	4c 89 85 70 ff ff ff 	mov    %r8,-0x90(%rbp)
  400555:	4c 89 8d 78 ff ff ff 	mov    %r9,-0x88(%rbp)
  40055c:	84 c0                	test   %al,%al
  40055e:	74 20                	je     400580
  400560:	0f 29 45 80          	movaps %xmm0,-0x80(%rbp)
  400564:	0f 29 4d 90          	movaps %xmm1,-0x70(%rbp)
  400568:	0f 29 55 a0          	movaps %xmm2,-0x60(%rbp)
  40056c:	0f 29 5d b0          	movaps %xmm3,-0x50(%rbp)
  400570:	0f 29 65 c0          	movaps %xmm4,-0x40(%rbp)
  400574:	0f 29 6d d0          	movaps %xmm5,-0x30(%rbp)
  400578:	0f 29 75 e0          	movaps %xmm6,-0x20(%rbp)
  40057c:	0f 29 7d f0          	movaps %xmm7,-0x10(%rbp)
  400580:	c7 85 30 ff ff ff 08 	movl   $0x8,-0xd0(%rbp)
  400587:	00 00 00
  40058a:	c7 85 34 ff ff ff 30 	movl   $0x30,-0xcc(%rbp)
  400591:	00 00 00
  400594:	48 8d 45 10          	lea    0x10(%rbp),%rax
  400598:	48 89 85 38 ff ff ff 	mov    %rax,-0xc8(%rbp)
  40059f:	48 8d 85 50 ff ff ff 	lea    -0xb0(%rbp),%rax
  4005a6:	48 89 85 40 ff ff ff 	mov    %rax,-0xc0(%rbp)
  4005ad:	48 c7 85 f0 fe ff ff 	movq   $0x1403,-0x110(%rbp)
  4005b4:	03 14 00 00
  4005b8:	48 8b 85 e8 fe ff ff 	mov    -0x118(%rbp),%rax
  4005bf:	48 89 85 f8 fe ff ff 	mov    %rax,-0x108(%rbp)
  4005c6:	48 8d 85 30 ff ff ff 	lea    -0xd0(%rbp),%rax
  4005cd:	48 89 85 00 ff ff ff 	mov    %rax,-0x100(%rbp)
  4005d4:	48 c7 85 08 ff ff ff 	movq   $0x0,-0xf8(%rbp)
  4005db:	00 00 00 00
  4005df:	48 c7 85 10 ff ff ff 	movq   $0x0,-0xf0(%rbp)
  4005e6:	00 00 00 00
  4005ea:	48 c7 85 18 ff ff ff 	movq   $0x0,-0xe8(%rbp)
  4005f1:	00 00 00 00
  4005f5:	48 8d 85 f0 fe ff ff 	lea    -0x110(%rbp),%rax
  4005fc:	b9 00 00 00 00       	mov    $0x0,%ecx
  400601:	89 ca                	mov    %ecx,%edx
  400603:	48 c1 c7 03          	rol    $0x3,%rdi
  400607:	48 c1 c7 0d          	rol    $0xd,%rdi
  40060b:	48 c1 c7 3d          	rol    $0x3d,%rdi
  40060f:	48 c1 c7 33          	rol    $0x33,%rdi
  400613:	48 87 db             	xchg   %rbx,%rbx
  400616:	48 89 d0             	mov    %rdx,%rax
  400619:	48 89 85 28 ff ff ff 	mov    %rax,-0xd8(%rbp)
  400620:	48 8b 85 28 ff ff ff 	mov    -0xd8(%rbp),%rax
  400627:	48 89 85 48 ff ff ff 	mov    %rax,-0xb8(%rbp)
  40062e:	48 8b 85 48 ff ff ff 	mov    -0xb8(%rbp),%rax
  400635:	c9                   	leaveq
  400636:	c3                   	retq

A quick code go-thru shows that this function does “NOTHING”, except saving some registers on the stack before updating them. This is actually the design of valgrind trapdoor – it should not change any registers or memory when the client program does not run with valgrind. In other words, only valgrind is able to interpret this trapdoor and do something with side effect. Let’s dive into this function.

The first around 20 lines are a typical usage of va_list, because VALGRIND_PRINTF accepts variable-length arguments like printf. Then we see bunch of values pushed into the stack, including this magic value 0x1403:

movq $0x1403, -0x110(%rbp)

And then more “useless” code near the end of the function:

rol $0x3, %rdi
rol $0xd, %rdi
rol $0x3d, %rdi
rol $0x33, %rdi
xchg %rbx, %rbx

After all those rotations, rdi is unchanged, as well as rbx. Now it is time to look at the valgrind.h file [3] to sort things out, and here it goes:

#define __SPECIAL_INSTRUCTION_PREAMBLE                            \
                     "rolq $3,  %%rdi ; rolq $13, %%rdi\n\t"      \
                     "rolq $61, %%rdi ; rolq $51, %%rdi\n\t"

#define VALGRIND_DO_CLIENT_REQUEST_EXPR(                          \
        _zzq_default, _zzq_request,                               \
        _zzq_arg1, _zzq_arg2, _zzq_arg3, _zzq_arg4, _zzq_arg5)    \
    __extension__                                                 \
    ({ volatile unsigned long int _zzq_args[6];                   \
    volatile unsigned long int _zzq_result;                       \
    _zzq_args[0] = (unsigned long int)(_zzq_request);             \
    _zzq_args[1] = (unsigned long int)(_zzq_arg1);                \
    _zzq_args[2] = (unsigned long int)(_zzq_arg2);                \
    _zzq_args[3] = (unsigned long int)(_zzq_arg3);                \
    _zzq_args[4] = (unsigned long int)(_zzq_arg4);                \
    _zzq_args[5] = (unsigned long int)(_zzq_arg5);                \
    __asm__ volatile(__SPECIAL_INSTRUCTION_PREAMBLE               \
                     /* %RDX = client_request ( %RAX ) */         \
                     "xchgq %%rbx,%%rbx"                          \
                     : "=d" (_zzq_result)                         \
                     : "a" (&_zzq_args[0]), "0" (_zzq_default)    \
                     : "cc", "memory"                             \
                    );                                            \
    _zzq_result;                                                  \

Turns out those rotations instructions are the essential trapdoor for x86_64. The xchg is used to ask valgrind to do a client request where the request number is _zzq_args[0] and the return value is saved into rdx. As you might have guessed, the request number is 0x1403 for VALGRIND_PRINTF.

In summary, when valgrind sees those rol instructions followed by an xchg, it recognizes this trapdoor, and passes arguments from the stack. The first argument, request number, determines which valgrind function will be called internally. The return value will be hold in rdx and then futher propagated to the client program via rax.

3. Fun

Once we know what the trapdoor looks like, we can get rid of the dependency on valgrind.h header file, and craft our own trapdoor. Say we wanna make our own VALGRIND_PRINTF. Then what we need are a va_list, filled with format strings and variable-length arguments, and the trapdoor instructions followed by xchg:

#define valgrind_printf_code		0x1403
#define valgrind_printf_fmt_str		"daveti: trapdoor, magic [%d]\n"
#define valgrind_trapdoor_code		\
	"rol $0x3, %%rdi\n\t"		\
	"rol $0xd, %%rdi\n\t"		\
	"rol $0x3d, %%rdi\n\t"		\
	"rol $0x33, %%rdi\n\t"

static unsigned long valgrind_printf_manual(char *fmt, ...)
	unsigned long args[6] = {0};
	unsigned long ret = 0;
	va_list vargs;

	/* Follow valgrind ABI */
	va_start(vargs, fmt);
	args[0] = (unsigned long)valgrind_printf_code;
	args[1] = (unsigned long)fmt;
	args[2] = (unsigned long)&vargs;

	/* rdx = client_req(rax); */
	asm volatile ("mov $0x0, %%rdx\n\t"	\
			valgrind_trapdoor_code	\
			"xchg %%rbx, %%rbx\n\t"	\
			: "=d"(ret)		\
			: "a"(&args[0])		\
			: "cc", "memory");
	return ret;

That’s it. Now we have a homemade VALGRIND_PRINT – valgrind_printf_manual, which behaves exactly what the former does, and we do not need to include valgrind.h header file at all.

NOTE 1: For other client requests (other than VALGRIND_PRINTF), the arguments building should be more straight-forward. VALGRIND_PRINTF is tricky due to the usage of variable-length arguments. And we decided it to use it because of its obvisous side effect (log printing in valgrind).

NOTE 2: While the trapdoor mechanism is the same across architectures, the trapdoor instructions are different among different architectures. We limit out focus on x86_64. Nevertheless, all these trapdoor instructions should follow the same deign goal – no changes on registers or memory when invoked without valgrind.

4. Security vs. Obfuscation

The design of valgrind trapdoor is delicate and useful. It gives a client program and opportunity to pass some useful information to valgrind, e.g., to suppress false positives in memcheck. Meanwhile, because we could craft the trapdoor manually, leaving no trace of valgrind in the binary, a client program is able to detect if it is running under valgrind essentially. Based on the detection result, the client program may do something totally different (for PoC, please check out [2]).

From security perspectives, a client program can detect the valgrind running environment, thus skip malicious behaviors which might be found by certain valgrind plugin, similar as VM detection techniques used by malware. From obfuscation points, a client program can also hide critical functionality from being analyzed by valgrind during runtime. Although I have not seen a strong motivation to detect valgrind as VM, the trapdoor mechanism has already provided a neat technique to achieve this.


[1] http://valgrind.org/docs/manual/manual-core-adv.html#manual-core-adv.clientreq
[2] https://github.com/daveti/valtrap
[3] https://github.com/daveti/valgrind/blob/zircon/include/valgrind.h

Posted in Programming, Security | Tagged , , , | Leave a comment

Some notes on SGX OwnerEpoch and Sealing

Intel SGX has been there in the market for while. Yet there are still a lot of misundrestandings and mysteries about this technology. This post provides an introduction to Intel SGX OwnerEpoch and Sealing, discusses their security impacts, and speculates future usages. Note that this post assumes a general understanding of Intel SGX and its key hierarchy.

1. Intro

SGX OwnerEpoch is a 128-bit value used in key derivation, as shown in the figure below [1]:


According to [1], this value is “loaded into the SGXOWNEREPOCH0 and SGXOWNEREPOCH1 MSRs when Intel SGX is booted”. The whole purpose of this value is to “provide a user with the ability to add personal entropy into the key derivation process”. As a result, it is included in all key derivations by egetkey leaf instruction based on [1], such as the Sealing key.

While an enclave provides runtime integrity and confidentiality, it cannot persist the secret across reboots. In such a case, sealing is to help. Intel SGX Sealing uses the egetkey leaf instruction to derive the sealing key on the platform. This sealing key is then used to encryt the secret within the enclave before it is written into the disk. Depending on the sealing policy, either public key of the enclave signer (MRSIGNER) or the measurement of the enclave (MRENCLAVE) can be used to derive the key, meaning that only the enclaves from the same signer or the ones with the exact measuremement can unseal (decrypt) the secret. Note that both sealing and unsealing should happen inside an enclave.

2. Security Impacts

Since OwnerEpoch is also included to derive the sealing key, changes of this regsiter would cause unsealing failure on the same platform. Consequently, a malicious cloud provider can launch DoS attacks against all SGX sealed secrets easily by updating the OwnerEpoch. Or in a more realistic case when there is a contract between the cloud provider and user, the cloud provider needs to guarantee that no code outside the TCB can update the OwnerEpoch (which is usually the case since wrmsr is a privilaged instruction, and hyperviosrs can trap it), and that no code outside the TCB can trick the TCB to update the OwnerEpoch (e.g., confused deputy attack and kernel exploitation). In the worst case, the current in-use OwnerEpoch should always have a backup to help restore the value for unsealing.

Although we could have 2 platforms with the exact same model of SGX CPU and exact same value for OwnerEpoch (we also assume the same CPUSVN and etc.), sealing on one platform cannot be unsealed on the other due to the unique device key per CPU package. This means SGX sealing does not support offline cross-platform data migration. As a workaround for this case, SGX remote attestation is needed to establish a shared secret as the sealing key rather than using the egetkey leaf instruction.

3. Speculations

A question comes naturally for the OwnerEpoch – why do we need it and what can we do with it? By definition, it is used to provide “user” entropy to the key derivation process. It also implies that the “user” should be the “owner” of the platform (CPU), since both rdmsr and wrmsr are privileged instructions. In a cloud environment, however, this “user==owner” relationship breaks. Cloud users are the “user” while cloud providers being the “owner”.

In a physical environment where the user “own” the infrastructure (IasS), the user should be able to set the OwnerEpoch whatever he wants. It is the same case as people running SGX applications on their own laptops. In this case, rather than providing entropy, the OwnerEpoch might be used as a peronal secret to pretect sealing data. For example, Alice saves the current OwnerEpoch value after sealing, and resets it to a random value. Eve cannot unseal the data even with root permission on Alice’s machine without the right OwnerEpoch.

In a container environment where different users running different containers on the same platform, none of the users would have the permission to update the OwnerEpoch. Instead, the cloud provider sets the value, and all users share the same OwnerEpoch during their key derivation. In this case, the OwnerEpoch seems meaningless for both cloud providers and users except adding more entropy.

In a hypervisor environment where different users running different guest OSes managed by the hypervior that has the sole control of the hardware (e.g., Xen and KVM), it is possible to virtualize the OwnerEpoch per guest (e.g., adding the OwnerEpoch into VMCS). Each user can provide his own OwnerEpoch for SGX key derivation. Note that this per-guest OwnerEpoch is only known to the guest and the cloud provider. As long as the cloud provider is trusted, this per-guest OwnerEpoch can be used as a personal secret as well. Note that this secret usage might be really useful when different users running the same enclave signed by the same ISV. In this case, similar as the physical environment, data sealed by Alice cannot be unsealed by Eve even they are running on the same platform.

4. Reality

While SGXv1 has introduced OwnerEpoch, it is not activated – we cannot write into it. SGXv2 claims the support for updating OwnerEpoch based on [2], my testing on a SGXv2 CPU said no. In fact, it behaves just like SGXv1 – The first OwnerEpoch read throws an unchecked MSR access error; the following write enables the read opertaion, although the value is always 0, no matter what value is written. To test the OnwerEpoch on your platform, please git clone [4]. My general feeling is that this OwnerEpoch is still not activated for some reason (at least on the SGXv2 CPU I tested). One comment from coreboot also suggests that the OwnerEpoch update mechanism is not determined yet [3]. Another update from [2] also shows that Provisoning and Provisioning Sealing keys do not rely on OwnerEpoch anymore.

5. Conclusion

We look into the OwnerEpoch and its connection with key derivations, e.g., SGX sealing. As we discussed above, the introducing of OwnerEpoch as extra entropy seems really vague. Nevertheless, we speculated its usage as a personal secret in the cloud environments. Our trial on SGXv2 seems to suggest that its usage is still unclear.


[1] https://software.intel.com/sites/default/files/managed/48/88/329298-002.pdf
[2] https://software.intel.com/en-us/articles/intel-sdm
[3] https://github.com/coreboot/coreboot/blob/master/src/soc/intel/common/block/sgx/sgx.c
[4] https://github.com/daveti/soe

Posted in Security | Tagged , , , , , | Leave a comment