Setting up Nvidia Jetson TX1

Starting from this post, I will share my learning and hacking experience on Nvidia Jetson TX1 dev board. This post discusses the installation issue of JetPack [4] and post-installation configurations for TX1. We assume users follow the JetPack 3.3 installation guide to setup the TX1.

1. DHCP Issue

One of the two possible configurations to setup JetPack on TX1 is to use DHCP, where the host machine is the DHCP server and the TX1 is the client. This connecting model is needed when there is no switch available and only the host machine has the Internet connection. In my case, the host machine connects with the Internet with Wifi and the Eithernet port is used to connect with TX1. Everything looks fine until TX1 tries to get an IP address from the host. “can’t determine the target IP” will be returned from the terminal and all the following JetPack installation on the TX1 will fail (although we have already flashed the L4T to the TX1 successfully). Turns out this is a known bug due to the argument changes within nmcli between Ubuntu 14.04 and 16.04 [1]. A detailed workaround is also provided there:


Although the issue was reported on JetPack 3.2 for TX2, JetPack 3.3 still has this issue on TX1. JetPack 4.0 hopefully would fix this configuration bug.

2. Mount SD Card

TX1 comes with 16G eMMC storage. After full installation of JetPack, only 5.3G is left. As a result, we need extra storage to do something useful, e.g., compiling the Linux kernel on TX1. Again, the devtalk forum has a good discussion [2]. I used gparted to partition and format a 64G SD card with EXT4. Then find the UUID using blkid. Once we have the UUID for the new partition, we can put it into /etc/fstab for auto mounting.


3. Setup An Account

After full deployment of JetPack on TX1, we have 2 accounts ready for use “ubuntu/ubuntu” and “nvidia/nvidia”. We can use the later one to do CUDA development. However, to support multiple users on the board, we need to create new users using adduser. The first thing after logging as a new user on TX1 might be “nvcc not found” – Duh! Since “nvidia” has CUDA environment setup already, let’s copy its .bashrc and .profile into the new account. We can then compile CUDA program using nvcc. But when we run the CUDA program, it is seg fault – “unhandled level 3 permission fault (11)”:


Turns out all the GPU device files under /dev (/dev/nvhost-*) belong to group “video”, and id on “nvidia” shows this group as well. Adding the new user into “video” group (sudo usermod -aG video newuser) solves this permission issue.



Posted in Embedded System, gpu | Tagged , , , , , , , | Leave a comment

Hacking Valgrind

This post talks about 3 commits I have recently added into my own valgrind tree [1], including the support for fsgsbase instructions, rdrand/rdseed instructions, and adding a new trapdoor (client request) to support gdb-like add-symbol-file command. Note that all these new features are not available in the mainstream valgrind by the time of writing, and I am not planning to work on mainstreaming anyway. Nevertheless, feel free the patch your own valgrind if needed. My work is supported by Fortanix [5].

1. Support for fsgsbase

fsgsbase instructions allow user space to read [6] or write [7] the FS or GS register base on x86_64 architecture, enabling indirect addressing mode using FS/GS, such as “mov %GS:0x10, %rax”. Surprisingly, the most challenging part (for me) was the decoding of amd64/x86_64 instructions. I am not interested in repeating how fucked-up this encoding mechanism is but only remind readers that opcode is USELESS on this architecture. Anyway, once we figure how to decode fsgsbase instructions in valgrind, we are able to generate the corresponding VEX IRs.

Although FS/GS base update from the user space is not supported, valgrind has FS/GS base registers built inside the guest VM state. Valgrind even hooks arch_prctl() syscall to update those guest registers. For us, we need to remove all those constrains assuming a constant FS/GS, and allow fsgsbase instructions to update FS/GS base registers in the guest. Because valgrind is emulating FS/GS in the guest, there is no need to check for the real hardware support for these instructions on the host. For details of the patch, please check [2].

2. Support for rdrand/rdseed

rdrand call the TRNG available inside the CPU to generate a random number [8]. rdseed is similar although focusing on providing random seed for PRNG [9]. The difference between them can be found at [10]. Unlike the fsgsbase instructions, valgrind needs to check whether or not the host CPU supports rdrand/rdseed when encountering these instructions in the client program, and delegate the acutal execution to the real CPU on the host. (Although we could emulate these instructions in valgrind as well, faithfully executing them is more preferred especially when the CPU supports these instructions.)

Once we have extended CPUID to detect these instructions on the host CPU, we can start to write down “dirtyhelpers” for rdrand/rdseed, which are the actual rdrand/rdseed instructions running on the real CPU. Because these instructions may fail (non-block, carry flag not set), we need to do a loop on the carry flag, making sure we return the right rand/seed to the guest. Similarly, a sane implementation of rdrand/rdseed within the client program should also do a loop on the carry flag. This means we need set the carry flag in the rflags of the guest VM state to help the client program move forward. Turns out it is not easy to do this in valgrind, because the rflags is not listed as other registers of the guest VM state explicitly. Instead, all these flags need to be computed based on the operation of the current instruction.

BTW, rdrand/rdseed is also a good example of the pathologicial design of x86_64 instruction encoding. They have the same opcode as cmpxchg8b and cmpxchg16b. For details of this patch, please check [3].

3. A new trapdoor: add-symbol-file

GDB supports loading symbols manually using add-symbol-file command. It is useful when GDB could not figure out what was loaded at certain VA range (thus ??? in the backtrace). Unfortunately, valgrind does not have such a mechanism. As a result, valgrind could not recognize any memory mapping not directly triggered by mmap() syscall, e.g., memcpy from VA1 to VA2. It also means valgrind could not recognize a binary doing a reloation by itself after the first mmap(), such as loader. Based on these considerations, we add a new valgrind trapdoor (client request) — VALGRIND_ADD_SYMBOL_FILE, allowing a client program to pass the memory mapping information to valgrind. It accepts 3 arguments, the file name of the mapping, e.g., a shared object, the starting mapping address (page aligned), and the length of the mapping. For details of this this patch, please check [4].



Posted in Dave's Tools, Programming | Tagged , , , , , , , , , | Leave a comment

Valgrind trapdoor and fun

Valgrind has a client request mechanism, which allows a client to pass some information back to valgrind. This includes asks valgrind to do a logging in its own environment, tells valgrind a range of VA being used as a new stack, and etc [1]. This mechansim is essentially a trapdoor built into VEX during the binary translation. We starts with a typical usage of valgrind trapdoor to add a logging into valgrind from the client. We then remove the dependency of valgrind header files and manually craft the trapdoor by ourselves. Note that this post is NOT about how to use the client request mechanism (read the damn manual:), nor on how to add a new client request (which I will talk in another post on valgrind hacking in general). Last, the code can be found at [2], and have fun:)

1. A typical usage

To use the valgrind trapdoor, we need to include the header: <valgrind/valgrind.h>. Let’s take VALGRIND_PRINTF as an example, which asks valgrind to add a logging for the client. The code below prints out the magic number in the valgrind logging:

#define valgrind_printf_fmt_str "daveti: trapdoor, magic [%d]\n"
int magic = 777;
/* Normal valgrind trapdoor */
ret = VALGRIND_PRINTF(valgrind_printf_fmt_str, magic);
printf("daveti: ret [%d]\n", ret);

When running with valgrind, the output looks like below:
**7382** daveti: trapdoor, magic [777]
daveti: ret [30]
The return value is the total length of the string printed out. Nothing fancy here but do remember that this logging is done by valgrind rather than the client program.

2. A better understanding

Alright, so what is that damn VALGRIND_PRINTF thing? Let’s have a deeper view using objdump (-S):

0000000000400527 :
  400527:	55                   	push   %rbp
  400528:	48 89 e5             	mov    %rsp,%rbp
  40052b:	48 81 ec a8 00 00 00 	sub    $0xa8,%rsp
  400532:	48 89 bd e8 fe ff ff 	mov    %rdi,-0x118(%rbp)
  400539:	48 89 b5 58 ff ff ff 	mov    %rsi,-0xa8(%rbp)
  400540:	48 89 95 60 ff ff ff 	mov    %rdx,-0xa0(%rbp)
  400547:	48 89 8d 68 ff ff ff 	mov    %rcx,-0x98(%rbp)
  40054e:	4c 89 85 70 ff ff ff 	mov    %r8,-0x90(%rbp)
  400555:	4c 89 8d 78 ff ff ff 	mov    %r9,-0x88(%rbp)
  40055c:	84 c0                	test   %al,%al
  40055e:	74 20                	je     400580
  400560:	0f 29 45 80          	movaps %xmm0,-0x80(%rbp)
  400564:	0f 29 4d 90          	movaps %xmm1,-0x70(%rbp)
  400568:	0f 29 55 a0          	movaps %xmm2,-0x60(%rbp)
  40056c:	0f 29 5d b0          	movaps %xmm3,-0x50(%rbp)
  400570:	0f 29 65 c0          	movaps %xmm4,-0x40(%rbp)
  400574:	0f 29 6d d0          	movaps %xmm5,-0x30(%rbp)
  400578:	0f 29 75 e0          	movaps %xmm6,-0x20(%rbp)
  40057c:	0f 29 7d f0          	movaps %xmm7,-0x10(%rbp)
  400580:	c7 85 30 ff ff ff 08 	movl   $0x8,-0xd0(%rbp)
  400587:	00 00 00
  40058a:	c7 85 34 ff ff ff 30 	movl   $0x30,-0xcc(%rbp)
  400591:	00 00 00
  400594:	48 8d 45 10          	lea    0x10(%rbp),%rax
  400598:	48 89 85 38 ff ff ff 	mov    %rax,-0xc8(%rbp)
  40059f:	48 8d 85 50 ff ff ff 	lea    -0xb0(%rbp),%rax
  4005a6:	48 89 85 40 ff ff ff 	mov    %rax,-0xc0(%rbp)
  4005ad:	48 c7 85 f0 fe ff ff 	movq   $0x1403,-0x110(%rbp)
  4005b4:	03 14 00 00
  4005b8:	48 8b 85 e8 fe ff ff 	mov    -0x118(%rbp),%rax
  4005bf:	48 89 85 f8 fe ff ff 	mov    %rax,-0x108(%rbp)
  4005c6:	48 8d 85 30 ff ff ff 	lea    -0xd0(%rbp),%rax
  4005cd:	48 89 85 00 ff ff ff 	mov    %rax,-0x100(%rbp)
  4005d4:	48 c7 85 08 ff ff ff 	movq   $0x0,-0xf8(%rbp)
  4005db:	00 00 00 00
  4005df:	48 c7 85 10 ff ff ff 	movq   $0x0,-0xf0(%rbp)
  4005e6:	00 00 00 00
  4005ea:	48 c7 85 18 ff ff ff 	movq   $0x0,-0xe8(%rbp)
  4005f1:	00 00 00 00
  4005f5:	48 8d 85 f0 fe ff ff 	lea    -0x110(%rbp),%rax
  4005fc:	b9 00 00 00 00       	mov    $0x0,%ecx
  400601:	89 ca                	mov    %ecx,%edx
  400603:	48 c1 c7 03          	rol    $0x3,%rdi
  400607:	48 c1 c7 0d          	rol    $0xd,%rdi
  40060b:	48 c1 c7 3d          	rol    $0x3d,%rdi
  40060f:	48 c1 c7 33          	rol    $0x33,%rdi
  400613:	48 87 db             	xchg   %rbx,%rbx
  400616:	48 89 d0             	mov    %rdx,%rax
  400619:	48 89 85 28 ff ff ff 	mov    %rax,-0xd8(%rbp)
  400620:	48 8b 85 28 ff ff ff 	mov    -0xd8(%rbp),%rax
  400627:	48 89 85 48 ff ff ff 	mov    %rax,-0xb8(%rbp)
  40062e:	48 8b 85 48 ff ff ff 	mov    -0xb8(%rbp),%rax
  400635:	c9                   	leaveq
  400636:	c3                   	retq

A quick code go-thru shows that this function does “NOTHING”, except saving some registers on the stack before updating them. This is actually the design of valgrind trapdoor – it should not change any registers or memory when the client program does not run with valgrind. In other words, only valgrind is able to interpret this trapdoor and do something with side effect. Let’s dive into this function.

The first around 20 lines are a typical usage of va_list, because VALGRIND_PRINTF accepts variable-length arguments like printf. Then we see bunch of values pushed into the stack, including this magic value 0x1403:

movq $0x1403, -0x110(%rbp)

And then more “useless” code near the end of the function:

rol $0x3, %rdi
rol $0xd, %rdi
rol $0x3d, %rdi
rol $0x33, %rdi
xchg %rbx, %rbx

After all those rotations, rdi is unchanged, as well as rbx. Now it is time to look at the valgrind.h file [3] to sort things out, and here it goes:

#define __SPECIAL_INSTRUCTION_PREAMBLE                            \
                     "rolq $3,  %%rdi ; rolq $13, %%rdi\n\t"      \
                     "rolq $61, %%rdi ; rolq $51, %%rdi\n\t"

#define VALGRIND_DO_CLIENT_REQUEST_EXPR(                          \
        _zzq_default, _zzq_request,                               \
        _zzq_arg1, _zzq_arg2, _zzq_arg3, _zzq_arg4, _zzq_arg5)    \
    __extension__                                                 \
    ({ volatile unsigned long int _zzq_args[6];                   \
    volatile unsigned long int _zzq_result;                       \
    _zzq_args[0] = (unsigned long int)(_zzq_request);             \
    _zzq_args[1] = (unsigned long int)(_zzq_arg1);                \
    _zzq_args[2] = (unsigned long int)(_zzq_arg2);                \
    _zzq_args[3] = (unsigned long int)(_zzq_arg3);                \
    _zzq_args[4] = (unsigned long int)(_zzq_arg4);                \
    _zzq_args[5] = (unsigned long int)(_zzq_arg5);                \
    __asm__ volatile(__SPECIAL_INSTRUCTION_PREAMBLE               \
                     /* %RDX = client_request ( %RAX ) */         \
                     "xchgq %%rbx,%%rbx"                          \
                     : "=d" (_zzq_result)                         \
                     : "a" (&_zzq_args[0]), "0" (_zzq_default)    \
                     : "cc", "memory"                             \
                    );                                            \
    _zzq_result;                                                  \

Turns out those rotations instructions are the essential trapdoor for x86_64. The xchg is used to ask valgrind to do a client request where the request number is _zzq_args[0] and the return value is saved into rdx. As you might have guessed, the request number is 0x1403 for VALGRIND_PRINTF.

In summary, when valgrind sees those rol instructions followed by an xchg, it recognizes this trapdoor, and passes arguments from the stack. The first argument, request number, determines which valgrind function will be called internally. The return value will be hold in rdx and then futher propagated to the client program via rax.

3. Fun

Once we know what the trapdoor looks like, we can get rid of the dependency on valgrind.h header file, and craft our own trapdoor. Say we wanna make our own VALGRIND_PRINTF. Then what we need are a va_list, filled with format strings and variable-length arguments, and the trapdoor instructions followed by xchg:

#define valgrind_printf_code		0x1403
#define valgrind_printf_fmt_str		"daveti: trapdoor, magic [%d]\n"
#define valgrind_trapdoor_code		\
	"rol $0x3, %%rdi\n\t"		\
	"rol $0xd, %%rdi\n\t"		\
	"rol $0x3d, %%rdi\n\t"		\
	"rol $0x33, %%rdi\n\t"

static unsigned long valgrind_printf_manual(char *fmt, ...)
	unsigned long args[6] = {0};
	unsigned long ret = 0;
	va_list vargs;

	/* Follow valgrind ABI */
	va_start(vargs, fmt);
	args[0] = (unsigned long)valgrind_printf_code;
	args[1] = (unsigned long)fmt;
	args[2] = (unsigned long)&vargs;

	/* rdx = client_req(rax); */
	asm volatile ("mov $0x0, %%rdx\n\t"	\
			valgrind_trapdoor_code	\
			"xchg %%rbx, %%rbx\n\t"	\
			: "=d"(ret)		\
			: "a"(&args[0])		\
			: "cc", "memory");
	return ret;

That’s it. Now we have a homemade VALGRIND_PRINT – valgrind_printf_manual, which behaves exactly what the former does, and we do not need to include valgrind.h header file at all.

NOTE 1: For other client requests (other than VALGRIND_PRINTF), the arguments building should be more straight-forward. VALGRIND_PRINTF is tricky due to the usage of variable-length arguments. And we decided it to use it because of its obvisous side effect (log printing in valgrind).

NOTE 2: While the trapdoor mechanism is the same across architectures, the trapdoor instructions are different among different architectures. We limit out focus on x86_64. Nevertheless, all these trapdoor instructions should follow the same deign goal – no changes on registers or memory when invoked without valgrind.

4. Security vs. Obfuscation

The design of valgrind trapdoor is delicate and useful. It gives a client program and opportunity to pass some useful information to valgrind, e.g., to suppress false positives in memcheck. Meanwhile, because we could craft the trapdoor manually, leaving no trace of valgrind in the binary, a client program is able to detect if it is running under valgrind essentially. Based on the detection result, the client program may do something totally different (for PoC, please check out [2]).

From security perspectives, a client program can detect the valgrind running environment, thus skip malicious behaviors which might be found by certain valgrind plugin, similar as VM detection techniques used by malware. From obfuscation points, a client program can also hide critical functionality from being analyzed by valgrind during runtime. Although I have not seen a strong motivation to detect valgrind as VM, the trapdoor mechanism has already provided a neat technique to achieve this.



Posted in Programming, Security | Tagged , , , | Leave a comment

Some notes on SGX OwnerEpoch and Sealing

Intel SGX has been there in the market for while. Yet there are still a lot of misundrestandings and mysteries about this technology. This post provides an introduction to Intel SGX OwnerEpoch and Sealing, discusses their security impacts, and speculates future usages. Note that this post assumes a general understanding of Intel SGX and its key hierarchy.

1. Intro

SGX OwnerEpoch is a 128-bit value used in key derivation, as shown in the figure below [1]:


According to [1], this value is “loaded into the SGXOWNEREPOCH0 and SGXOWNEREPOCH1 MSRs when Intel SGX is booted”. The whole purpose of this value is to “provide a user with the ability to add personal entropy into the key derivation process”. As a result, it is included in all key derivations by egetkey leaf instruction based on [1], such as the Sealing key.

While an enclave provides runtime integrity and confidentiality, it cannot persist the secret across reboots. In such a case, sealing is to help. Intel SGX Sealing uses the egetkey leaf instruction to derive the sealing key on the platform. This sealing key is then used to encryt the secret within the enclave before it is written into the disk. Depending on the sealing policy, either public key of the enclave signer (MRSIGNER) or the measurement of the enclave (MRENCLAVE) can be used to derive the key, meaning that only the enclaves from the same signer or the ones with the exact measuremement can unseal (decrypt) the secret. Note that both sealing and unsealing should happen inside an enclave.

2. Security Impacts

Since OwnerEpoch is also included to derive the sealing key, changes of this regsiter would cause unsealing failure on the same platform. Consequently, a malicious cloud provider can launch DoS attacks against all SGX sealed secrets easily by updating the OwnerEpoch. Or in a more realistic case when there is a contract between the cloud provider and user, the cloud provider needs to guarantee that no code outside the TCB can update the OwnerEpoch (which is usually the case since wrmsr is a privilaged instruction, and hyperviosrs can trap it), and that no code outside the TCB can trick the TCB to update the OwnerEpoch (e.g., confused deputy attack and kernel exploitation). In the worst case, the current in-use OwnerEpoch should always have a backup to help restore the value for unsealing.

Although we could have 2 platforms with the exact same model of SGX CPU and exact same value for OwnerEpoch (we also assume the same CPUSVN and etc.), sealing on one platform cannot be unsealed on the other due to the unique device key per CPU package. This means SGX sealing does not support offline cross-platform data migration. As a workaround for this case, SGX remote attestation is needed to establish a shared secret as the sealing key rather than using the egetkey leaf instruction.

3. Speculations

A question comes naturally for the OwnerEpoch – why do we need it and what can we do with it? By definition, it is used to provide “user” entropy to the key derivation process. It also implies that the “user” should be the “owner” of the platform (CPU), since both rdmsr and wrmsr are privileged instructions. In a cloud environment, however, this “user==owner” relationship breaks. Cloud users are the “user” while cloud providers being the “owner”.

In a physical environment where the user “own” the infrastructure (IasS), the user should be able to set the OwnerEpoch whatever he wants. It is the same case as people running SGX applications on their own laptops. In this case, rather than providing entropy, the OwnerEpoch might be used as a peronal secret to pretect sealing data. For example, Alice saves the current OwnerEpoch value after sealing, and resets it to a random value. Eve cannot unseal the data even with root permission on Alice’s machine without the right OwnerEpoch.

In a container environment where different users running different containers on the same platform, none of the users would have the permission to update the OwnerEpoch. Instead, the cloud provider sets the value, and all users share the same OwnerEpoch during their key derivation. In this case, the OwnerEpoch seems meaningless for both cloud providers and users except adding more entropy.

In a hypervisor environment where different users running different guest OSes managed by the hypervior that has the sole control of the hardware (e.g., Xen and KVM), it is possible to virtualize the OwnerEpoch per guest (e.g., adding the OwnerEpoch into VMCS). Each user can provide his own OwnerEpoch for SGX key derivation. Note that this per-guest OwnerEpoch is only known to the guest and the cloud provider. As long as the cloud provider is trusted, this per-guest OwnerEpoch can be used as a personal secret as well. Note that this secret usage might be really useful when different users running the same enclave signed by the same ISV. In this case, similar as the physical environment, data sealed by Alice cannot be unsealed by Eve even they are running on the same platform.

4. Reality

While SGXv1 has introduced OwnerEpoch, it is not activated – we cannot write into it. SGXv2 claims the support for updating OwnerEpoch based on [2], my testing on a SGXv2 CPU said no. In fact, it behaves just like SGXv1 – The first OwnerEpoch read throws an unchecked MSR access error; the following write enables the read opertaion, although the value is always 0, no matter what value is written. To test the OnwerEpoch on your platform, please git clone [4]. My general feeling is that this OwnerEpoch is still not activated for some reason (at least on the SGXv2 CPU I tested). One comment from coreboot also suggests that the OwnerEpoch update mechanism is not determined yet [3]. Another update from [2] also shows that Provisoning and Provisioning Sealing keys do not rely on OwnerEpoch anymore.

5. Conclusion

We look into the OwnerEpoch and its connection with key derivations, e.g., SGX sealing. As we discussed above, the introducing of OwnerEpoch as extra entropy seems really vague. Nevertheless, we speculated its usage as a personal secret in the cloud environments. Our trial on SGXv2 seems to suggest that its usage is still unclear.



Posted in Security | Tagged , , , , , | Leave a comment

Kernel Code Execution Time Measurement (kcetm)

This post mainly talks about the correct usage of tsc counters provided by Intel x86/x86-64 architectures to measure the Linux kernel code execution time. Most of the content here is borrowed/inspired from [1]. Note that this is NOT a post introduing tsc but for people who need to use tsc anyway. A kernel module with us/ns/tsc measurement framework ready is also provided at the end of the post [4]. Happy measurement!

1. Why measuring time is hard?

Human beings have a long history of trying to measure time [3], although this is not our focus. We are talking about measuring a piece of code execution running at ring 0 on a modern x86/x86-64 Intel CPU. The measurement framework usually looks like below:


The most common measurement functions inside the kernel are do_gettimeofday(), which returns in macroseconds, and getnstimeofday64(), which returns in nanoseconds. Depending on the complexity of the code needs to be measured, different scaling measurement should be applied. E.g., measuring a sub macrosecond function using do_gettimeofday() is inappropreiate. Similarly, measuring a minute level function using getnstimeofday64() is also an overkill.

In most cases, when the piece of code measured is non-trivial and takes a substantial amount of time (e.g., miliseconds), these gettimeofday functions do their job. We usually ignore the delay caused by interrupt handlings, preemptions, and schedulings. However, when the code to be measured is really small (e.g., tens of assemblies), the delay caused by interrupt handlings, preemptions, and schedulings cannot be ignored anymore, since we are measuring a nanosecond level execution time. In this case, we need to disable (local) interrupt, and preemption, making sure the measurement is “atomic” without external interferences. The other thing we need is tsc (timestamp counter).

2. What is the right way to use tsc?

Here we ignore the mov instructions needed to read/merge the tsc values. Instead, we focus on different usage patterns of rdtsc and rdtscp. We explore these patterns and point out the potential issues.

Pattern A:

...       A
...       |
rdtsc;    V
rdtsc;    A
...       |
...       V

This might be the most straightforward way of using tsc. We call rdtsc to read the counter before and after the code_to_measure. Unfortunately, because of out of order execution of CPU, it is hard to guarantee that the code_to_measure within the rdtsc blocks is exactly the code we wanna measure. It turns out either the code belong to code_to_measure could be executed by the CPU ahead of the first rdtsc or after the second rdtsc, or whatever code before the first rdtsc or after the second rdtsc could be measured as part of the code_to_measure. In short, this pattern gives the measurement of timing(some_portion(code_to_measure) + some_portion(code_to_measure_before) + some_portion(code_to_measure_after)).

Pattern B:

...       A
...       X
...       V
rdtsc;    | A
...       X |
...       | |
...       V |

To surpress the effect of out of order execution, we need a processor barrier to guarantee that code_to_measure is exactly the code we wanna measure when we setup the tsc counters – no more or less (hopefully). CPUID is one such instruction that serializes the instruction flow. The first CPUID guarantees that all instructions before it have done; the second CPUID guarantees that code_to_measure is also done before we execute the second rdtsc. However, it is still possible that the code after the second rdtsc get executed after the second CPUID but before we hit the second rdtsc (we mark this as delta). Similarly, the first rdtsc, should be execuated as the first instruction once the CPUID is done. However, out of order execution within the CPUID block is still possible, making few instructions of code_to_measure execute before the first rdtsc (we mark this as the other delta). Fortunately, because rdtsc uses RAX and RDX the same, it is possible that the CPU would get stalled when it tries to excute instructions followed by rdtsc due to data hazard. Nevertheless, the biggest problem comes from the extra CPUID instruction measured. Per [1], the exeuction time of CPUID itself is NOT stable, ranging from tens of CPU cycles to hundreds. This poses issues when the code_to_measure has similar timing scale. In short, this pattern gives the measurement of timing(code_to_measure + cpuid – delta_1 + delta_2).

Pattern C:

...       A
...       X
...       V
...       A
...       X
...       V

Comparing with the previous pattern, we add CPUID after each rdtsc. The first added CPUID gurantees that rdtsc is executed before any instructions from code_to_measure; the last CPUID added guarantees that the last rdtsc is executed after code_to_measure is done and before whatever code followed. This solution elimites the measurement deltas in the price of the other extra CPUID in the measurement block. In short, this pattern gives the measurement of timing(code_to_measure + 2*cpuid).

Pattern D:

...       A
...       X
...       V
...       A
...       X
...       V

At this point, I hope it is clear that the first CPUID has to happen before the first rdtsc (otherwise we may include the code above into the measurement block since the first rdtsc is not guaranteed to get executed right before the CPUID). Moving the second rdtsc before the second CPUID gurantees that the code after the second CPUID will not get measured. There is also no extra CPUID added into the measurement block. Unfortunately, this does not work due to the possible out of order execution insided the CPUID block. In short, this pattern gives the measurement of timing(some_portion(code_to_measure)).

Pattern E:

...       A
...       X
...       V
rdtscp;   | A
...       X |
...       | |
...       V |

To deal with the measurement variance introduced by the usage of CPUID, rdtscp is introduced in later CPUs, which is essentially a combination of CPUID+rdtsc. This instruction is also able to guarantee that by the execution of itself, all the instructions before it have been done. Unfortunately, the instructions after rdtscp may still get executed before rdtscp itself. In short, this pattern gives the measurement of timing(code_to_measure – delta_1 + some_portion(code_after_rdtscp)).

Pattern F:

...       A
...       X
...       V
...       A
...       X
...       V

By appending another CPUID after rdtscp, now the code after rdtscp cannot get executed before rdtscp. In short, this pattern gives the measurement of timing(code_to_measure – delta_1). If we ignore delta here, this pattern provides the most “accurate” measurement so far.

Pattern G:

...       X
...       | A
...       V |
rdtscp;     |
...       A
...       X
...       V

Comparing to the previous pattern, we replaced the first CPUID+rdtsc with rdtscp. While the first rdtscp guarantees that all the instructions before itself have been done, it cannot stop the CPU from executing some code from code_to_measure before rdtscp. To fix this problem, we can add a CPUID right after the first rdtscp, which unfortunately defeats the purpose of avoiding CPUID inside the measurement block. In short, this pattern gives the measurement of timing(some_portion(code_to_measure)).

3. Discussion

There are other instructions like CPUID, which can serialize the instruction flow, such as lfense and mfense (which are also used to defend Meldown and Spectre attacks). For user space measurement using tsc, [2] provides examples. Note, however, the measurement from user space includes the noise from interrupt handling, preemption, and scheduling.

4. Conclusion

It is easy to use tsc, but hard to use it right. Even if we use it right (e.g., Pattern F), we still could not guarantee the measurement is 100% right, but we are getting pretty close. The kcetm [5] implements Pattern F in a kernel module and is free to use.



Posted in Dave's Tools, OS, Programming | Tagged , , , , , , , , | 1 Comment

Rowhammer Pine64

Rowhammer attacks have been well known, and gotten a lot of publications already. However, we notice that most rowhammers happened on x86 architecture due to the easy access to clflush from the user space. ARM architecture (both ARMv7 and ARMv8) is also explored by DRAMA [1], where Android Ion memory allocator is leveraged to map un-cached pages into the user space. The reason is the cache line flush instruction in ARMv7 is ring-0 instruction. However, ARMv8 exposes cache line flush instruction to the user space again, providing an opportunity to rowhammer from the user space. This post explores rowhammer attacks using ARMv8 cache line flush instructions from the user space using Pine64 board. Let’s start~

1. Setup Pine64

Pine64 [2] is a cheap ARM 64-bit SoC. The one we are using is Pine A64+, with a quad-core ARM Cortex A53 64-Bit processor, and 2G LPDDR3. Both the support for ARMv8 ISA and equipment with DDR3 memory make this board a desired target. However, setting up this board really took us a while. We have encoutered both hardware and software issues on this board – I guess this is you get what you paid:(

Issue-I: Serial console on the Ext connector does NOT work!

As a result, none of the Linux images we tried (Ubuntu, Arch, OpenSUSE, Fedora) worked. In most cases, the booting process stuck at the uboot shell, and the terminal (screen/minicom) got scrambled characters.

Solution: Use the serial console on the Euler bus [3].


Issue-II: Ubuntu/Arch/OpenSUSE images do NOT work!

It is really frustrating for us that major images from Pine A64 software release [4] just do not work. In the best cases, we were able to boot into the kernel but stuck at login. Or the board kept rebooting itself. It is still not clear for us why these mainstream images do not work on our board.

Solution: Use Fedora ARM aarch64 image [5]!

Although Fedora support is never official for Pine64, as one post [6] pointed out that Fedora 27 Minimum image works with some tweaks. We verified that it indeed works on our board. Further, we use Fedora 27 Workstation on our board, and it works like a charm. To easily burn a Fedora image into SD card, we recommend a Fedora host machine with the following procedures:

sudo dnf install uboot-images-armv8 fedora-arm-installer
sudo fedora-arm-image-installer --image=Fedora-Workstation-27-1.6.aarch64.raw.xz --target=pine64_plus --media=/dev/sdb --norootpass

Note that “–norootpass” is needed for Workstation since no user will be created during the kernel booting. “–resizefs” should also be useful to make full usage of the SD card. For more information about arm-image-installer, please refer to [7].

Once we boot into the Linux kernel, we should be able to see things below:

[daveti@localhost ~]$ uname -a
Linux localhost.localdomain 4.13.9-300.fc27.aarch64 #1 SMP Mon Oct 23 13:33:18 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux
[daveti@localhost ~]$ cat /proc/cpuinfo
processor : 0
BogoMIPS : 48.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4

processor : 1
BogoMIPS : 48.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4

processor : 2
BogoMIPS : 48.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4

processor : 3
BogoMIPS : 48.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4

[daveti@localhost ~]$

2. Rowhammer from the user space

One way we decided to do rowhammer is from the user space, without the knowledge of how virtual address, physical address, bank, and row are mapped inside the DRAM. Our goal is to launch this attack as a normal user without any privillge. We also limit our goal to find bit flips within DRAM, rather than developing ful exploitations. We built our tool based on Google’s rowhammer-test, extending the support for ARMv8 64-bit (aarch64) architecture.

Unlike ARMv7, ARMv8 exposes cache line flush instructions to the user space. This provides us an opportunity to find the replacement for “clflush” for ARMv8. Unfortunately, cache line flushing in ARMv8 is not that straightforward as x86. Instead of just one instruction, more than 10 instructions are needed to flush data cache. One can refer to uboot “__asm_flush_dcache_range” [8], and ARMv8-A Programmers Guide [9]. To accelerate the cache line flush, we move the computation for cache line mask out of loops, leaving 4 instructions in total (comparing to 1 clflush). The detailed code change can be found at [10]. Note that cache line mask needs to be applied to the virtual address before flushing. We tried direct “dc civac %[addr]”, and got kernel Oops:

[ 3484.162251] rowhammer_test[9281]: unhandled level 0 translation fault (11) at 0xffffffffffffffff, esr 0x92000144, in rowhammer_test[400000+2000]
[ 3484.175317] CPU: 1 PID: 9281 Comm: rowhammer_test Not tainted 4.13.9-300.fc27.aarch64 #1
[ 3484.183453] Hardware name: sunxi sunxi/sunxi, BIOS 2017.09 10/10/2017
[ 3484.189901] task: ffff800065d90000 task.stack: ffff800063d1c000
[ 3484.195853] PC is at 0x400ad8
[ 3484.198830] LR is at 0x400a8c
[ 3484.201822] pc : [<0000000000400ad8>] lr : [<0000000000400a8c>] pstate: 60000000
[ 3484.209228] sp : 0000ffffee7a9010
[ 3484.212573] x29: 0000ffffee7a9060 x28: 0000ffffee7a9050
[ 3484.217894] x27: 0000ffffee7a9010 x26: 0000ffffee7a9050
[ 3484.223232] x25: 0000ffffee7a9060 x24: 0000000000000001
[ 3484.228553] x23: 000000000000000a x22: 0000000000400000
[ 3484.233890] x21: 0000000000400e40 x20: 0000000000400f40
[ 3484.239211] x19: 0000000000420000 x18: 0000000000000020
[ 3484.244548] x17: 0000ffffa3f0d3d0 x16: 0000000000420040
[ 3484.249869] x15: 0000ffffa3ed4a60 x14: 0003c5c40caf2f6b
[ 3484.255206] x13: 00000003e8000000 x12: 0000ffffa3ed5c34
[ 3484.260543] x11: 0000ffffa3ee3dd8 x10: 0000ffffa3ed6d28
[ 3484.265862] x9 : 003b9aca00000000 x8 : 0000000000000000
[ 3484.271199] x7 : 00000000e961b8a2 x6 : 0000ffffa40730d0
[ 3484.276519] x5 : 0000ffffa4073080 x4 : 0000000000083d60
[ 3484.281855] x3 : 0000ffff9843a000 x2 : 0000ffffee7a9018
[ 3484.287176] x1 : 0000000000000000 x0 : ffffffffffffffff
[ 5689.587275] rowhammer_test[9536]: unhandled level 2 translation fault (11) at 0x00000000, esr 0x92000146, in rowhammer_test[400000+2000]
[ 5689.599618] CPU: 1 PID: 9536 Comm: rowhammer_test Not tainted 4.13.9-300.fc27.aarch64 #1
[ 5689.607753] Hardware name: sunxi sunxi/sunxi, BIOS 2017.09 10/10/2017
[ 5689.614201] task: ffff800065d95a00 task.stack: ffff800079290000
[ 5689.620154] PC is at 0x400ad8
[ 5689.623131] LR is at 0x400a8c
[ 5689.626106] pc : [<0000000000400ad8>] lr : [<0000000000400a8c>] pstate: 20000000
[ 5689.633531] sp : 0000ffffe3ae6600
[ 5689.636853] x29: 0000ffffe3ae6650 x28: 0000ffffe3ae6640
[ 5689.642190] x27: 0000ffffe3ae6600 x26: 0000ffffe3ae6640
[ 5689.647531] x25: 0000ffffe3ae6650 x24: 0000000000000001
[ 5689.652852] x23: 000000000000000a x22: 0000000000400000
[ 5689.658190] x21: 0000000000400e40 x20: 0000000000400f40
[ 5689.663509] x19: 0000000000420000 x18: 0000000000000020
[ 5689.668846] x17: 0000ffff86a133d0 x16: 0000000000420040
[ 5689.674167] x15: 0000ffff869daa60 x14: 002256158d7d8201
[ 5689.679506] x13: 00000003e8000000 x12: 0000ffff869dbc34
[ 5689.684827] x11: 0000ffff869e9dd8 x10: 0000ffff869dcd28
[ 5689.690164] x9 : 003b9aca00000000 x8 : 0000000000000000
[ 5689.695483] x7 : 00000000e961b8a2 x6 : 0000ffff86b790d0
[ 5689.700822] x5 : 0000ffff86b79080 x4 : 0000000000083d60
[ 5689.706143] x3 : 0000000000000000 x2 : 0000ffffe3ae6670
[ 5689.711481] x1 : 0000000000000000 x0 : 0000000000000000
[daveti@localhost ~]$

3. Cache line flushing is expensive!

We ran our code with cache line flushing enabled and disabled (Note that bit flips do NOT depend on cache line flushing; however, the detection of rowhammer needs it). We show some samples runs of rowhammer attacks here:

[daveti@localhost rowhammer-test]$ ./rowhammer_test
cache line mask [0x3f]
Iteration 0 (after 0.00s)
Took 1007.3 ms per address set
Took 10.0728 sec in total for 10 address sets
Took 233.167 nanosec per memory access (for 43200000 memory accesses)
This gives 34310 accesses per address per 64 ms refresh period
Checking for bit flips took 1.095023 sec
Iteration 1 (after 11.17s)
Took 1006.9 ms per address set
Took 10.069 sec in total for 10 address sets
Took 233.080 nanosec per memory access (for 43200000 memory accesses)
This gives 34323 accesses per address per 64 ms refresh period
Checking for bit flips took 1.092644 sec
Iteration 2 (after 22.33s)
Took 1009.1 ms per address set
Took 10.0913 sec in total for 10 address sets
Took 233.595 nanosec per memory access (for 43200000 memory accesses)
This gives 34247 accesses per address per 64 ms refresh period
Checking for bit flips took 1.092391 sec
Iteration 3 (after 33.51s)
Took 1005.2 ms per address set
Took 10.052 sec in total for 10 address sets
Took 232.685 nanosec per memory access (for 43200000 memory accesses)
This gives 34381 accesses per address per 64 ms refresh period
Checking for bit flips took 1.081366 sec 

[daveti@localhost rowhammer-test]$ ./rowhammer_test
cache line mask [0x3f]
Iteration 0 (after 0.00s)
Took 90.4 ms per address set
Took 0.903528 sec in total for 10 address sets
Took 20.915 nanosec per memory access (for 43200000 memory accesses)
This gives 382500 accesses per address per 64 ms refresh period
Checking for bit flips took 1.093423 sec
Iteration 1 (after 2.00s)
Took 121.6 ms per address set
Took 1.21641 sec in total for 10 address sets
Took 28.158 nanosec per memory access (for 43200000 memory accesses)
This gives 284113 accesses per address per 64 ms refresh period
Checking for bit flips took 1.096945 sec
Iteration 2 (after 4.31s)
Took 104.5 ms per address set
Took 1.04498 sec in total for 10 address sets
Took 24.189 nanosec per memory access (for 43200000 memory accesses)
This gives 330723 accesses per address per 64 ms refresh period
Checking for bit flips took 1.089356 sec
Iteration 3 (after 6.45s)
Took 133.6 ms per address set
Took 1.33556 sec in total for 10 address sets
Took 30.916 nanosec per memory access (for 43200000 memory accesses)
This gives 258768 accesses per address per 64 ms refresh period
Checking for bit flips took 1.089574 sec
Iteration 4 (after 8.87s)
Took 122.0 ms per address set
Took 1.22025 sec in total for 10 address sets
Took 28.247 nanosec per memory access (for 43200000 memory accesses)
This gives 283220 accesses per address per 64 ms refresh period
Checking for bit flips took 1.077736 sec 

The first one enabled cache flusing; the second disabled it. Comparing the results, the cache flushing in ARMv8 took around 200 nanosec per memory access. That means 200 CPU cycles for a 1 GHz CPU.



Posted in Linux Distro, OS, Security | Tagged , , , , , | Leave a comment

“make deb-pkg” broken

Last time when I hacked the Linux kernel on Ubuntu, it was 4.4 on LTS 14.04. Time flies. Now I need to hack the kernel 4.13 again on LTS 16.04, and find the kernel build broken. It is fine that we could not have a stable kernel API since at least the ABI is stable. But what about kernel build?

1. “make deb-pkg” broken

This used to be my way to build a kernel on Ubuntu. If something wrong happens and stops the build, just fix the error and repeat the command again — it was incremental build. But now it is broken, as it launches “make clean” by default. Why? Because it needs to generate a source tar at first, which requires a “clean” to exclude all build artifacts[1]. Instead, use “make bindeb-pkg“.

2. VirtualBox build broken

I have no idea when Ubuntu decided to include VirtualBox guest/drivers into the kernel tree, but it is broken, because the source directory structure is different from the ones used by the makefiles. An “ln” based workaround is available[2], although I just hacked those makefiles to fix errors.


Posted in Linux Distro | Tagged , , | 1 Comment

Running Multics on Linux (Fedora 27)

This post follows the “Multics Simulator Instructions”[1] (with some tweaks) to setup Multics simulator dps8m and run Multics on my Fedora 27. Other Linux distro (Ubuntu/Debian/Raspbian) may need some changes but basically work the same way. Experience the cutting-edge secure operating system made on 1960s by yourself!


1. Install deps – libuv

While libuv can be built from source[2], it would be more easier to install the library using your package manager. For Fedora 27:

dnf install libuv libuv-devel

2. Build the simulator – dps8

Once libuv and its header files are installed, go ahead to build the simulator following the instructions[2]:

git clone git:// dps8m-code
cd dps8m-code
git checkout R1.0

Once the build is done, the dps8 binary can be found under dps8m-code/src/dps8.

3. Run dps8

mkdir -p $HOME/multics/linux
cp $HOME/dps8m-code/src/dps8/dps8 $HOME/multics/linux
cp $HOME/dps8m-code/src/dps8/Devices.txt $HOME/multics/linux

This should start the simulator. Type “quit” to exit the simulator:

[daveti@daveti linux]$ ./dps8
DPS8/M emulator (git b7a50ffc)
Production build
DPS8M system session id is 18555
Please register your system at
or create the file ‘serial.txt’ containing the line ‘sn: 0’.
FNP telnet server port set to 6180

DPS8M simulator V4.0-0 Beta git commit id: c420925a
sim> quit
[daveti@daveti linux]$

4. Run Multics

Download the QuickStart disk image, unzip it, and mv/rename disk image directory as multics/qs:

unzip $HOME/QuickStart*.zip
mv $HOME/QuickStart_MR12.6f $HOME/multics/qs
cd $HOME/multics/qs
../linux/dps8 MR12.6f_boot.ini

The whole booting process takes a while, but you should be able to see something below if Multics starts correctly:

[daveti@daveti qs]$ ../linux/dps8 MR12.6f_boot.ini
DPS8/M emulator (git b7a50ffc)
Production build
DPS8M system session id is 18555
Please register your system at
or create the file ‘serial.txt’ containing the line ‘sn: 0’.
FNP telnet server port set to 6180

DPS8M simulator V4.0-0 Beta git commit id: c420925a
TAPE: unit is read only
bootload_0: Booting system MR12.6f generated 01/09/17 1119.1 pst Mon.
0426.6 announce_chwm: 428. pages used of 512. in wired environment.
0426.6 announce_chwm: 706. words used of 1024. in int_unpaged_page_tables.
find_rpv_subsystem: Enter RPV data: M-> [auto-input] rpv a11 ipc 3381 0a

0426.7 load_mst: 946. out of 1048. pages used in disk mst area.
bce (early) 0426.7: M-> [auto-input] bce

Multics Y2K. System was last shudown/ESD at:
Wednesday, January 17, 2018 17:07:31 pst
Current system time is: Wednesday, January 17, 2018 20:26:43 pst.
Is this correct? M-> [auto-input] yes

The current time is more than the supplied boot_delta hours beyond the
unmounted time recorded in the RPV label. Is this correct? M-> [auto-input] yes

bce (boot) 2026.7: M-> [auto-input] yes

bce: Unrecognizable request. Type lr for a list of requests.
bce (boot) 2026.7: M-> [auto-input] boot star

Multics MR12.6f – 01/17/18 2027.0 pst Wed
2027.0 Loading FNP d, >user_dir_dir>SysAdmin>a>mcs.7.6c>site_mcs 7.6c
Received BOOTLOAD command…
listening to 6180
2027.0 FNP d loaded successfully

scavenge_vol: No volumes found
2027 as as_init_: Multics MR12.6f; Answering Service 17.0
2027 as LOGIN IO.SysDaemon dmn cord (create)
2027 as LOGIN Backup.SysDaemon dmn bk (create)
2027 as LOGIN IO.SysDaemon dmn prta (create)
2027 as LOGIN Utility.SysDaemon dmn ut (create)
2027 as LOGIN Volume_Dumper.Daemon dmn vinc (create)
fnpuv_open_slave 3.31
listening on port 6131
2027 as as_mcs_mpx_: Load signalled for FNP d.

2027 cord Enter command: coordinator, driver, or logout:
–> cord
2027 bk
2027 prta Enter command: coordinator, driver, or logout:
–> prta
2027 ut copy_dump: Attempt to re-copy an invalid dump.
2027 bk r 20:27 0.446 37
2027 bk
–> bk
2027 vinc
2027 vinc r 20:27 0.450 25
2027 vinc
–> vinc
2027 as sc_admin_command_: Utility.SysDaemon.z: delete_old_pdds
2027 ut send_admin_command: Execution started …
2027 ut completed.
2027 ut monitor_quota: The requested action was not performed.
2027 ut The quota of >dumps is 0, a record limit needs to be specified.
2027 ut
2027 ut Records Left % VTOCEs Left % PB/PD LV Name
2027 ut
2027 ut 166172 100086 60 42218 34130 81 pb root
2027 ut
2027 ut r 20:27 2.481 498
2027 ut
–> ut
2027.2 RCP: Attached tapa_00 for Utility.SysDaemon.z
2027.2 RCP: Detached tapa_00 from Utility.SysDaemon.z
2027.2 RCP: Attached rdra for Utility.SysDaemon.z
2027.2 RCP: Detached rdra from Utility.SysDaemon.z
2027.2 RCP: Attached puna for Utility.SysDaemon.z
2027.2 RCP: Detached puna from Utility.SysDaemon.z
2027.2 RCP: Attached prta for Utility.SysDaemon.z
2027.2 RCP: Detached prta from Utility.SysDaemon.z

Once we have Multics running, we can try to login the system with the help of telnet and the login command in a new terminal. For the first-time login as “Repair” user, the default passwd is “repair”. A new passwd is asked immediately by the system used to replace the old passwd. NOTE: you need to create a new passwd everytime logging into the system! Once logged in, type simple “who” command, and then “logout”:

[daveti@daveti ~]$ telnet localhost 6180
Trying ::1…
telnet: connect to address ::1: Connection refused
Connected to localhost.
Escape character is ‘^]’.
HSLA Port (d.h000,d.h001,d.h002,d.h003,d.h004,d.h005,d.h006,d.h007,d.h008,d.h009,d.h010,d.h011,d.h012,d.h013,d.h014,d.h015,d.h016,d.h017,d.h018,d.h019,d.h020,d.h021,d.h022,d.h023,d.h024,d.h025,d.h026,d.h027,d.h028,d.h029)?
Attached to line d.h000

Multics MR12.6f: Installation and location (Channel d.h000)
Load = 5.0 out of 90.0 units: users = 5, 01/17/18 2035.3 pst Wed

login Repair -cpw
New Password:
New Password Again:
Password changed.
You are protected from preemption.
Repair.SysAdmin logged in 01/17/18 2035.6 pst Wed from ASCII terminal “none”.

New messages in message_of_the_day:

Welcome to the Multics System.

print_motd: Created >user_dir_dir>SysAdmin>Repair>Repair.value.
r 20:35 0.469 33

who -a -lg

Multics MR12.6f; Installation and location
Load = 6.0 out of 90.0 units; users = 6, 1 interactive, 5 daemons.
Absentee users = 0 background; Max background absentee users = 3
System up since 01/17/18 2027.0 pst Wed
Last shutdown was at 01/17/18 1707.4 pst Wed

Login at TTY Load User ID

01/17/18 20:27 cord 1.0 IO.SysDaemon
20:27 bk 1.0 Backup.SysDaemon
20:27 prta 1.0 IO.SysDaemon
20:27 ut 1.0 Utility.SysDaemon
20:27 vinc 1.0 Volume_Dumper.Daemon
20:35 none 1.0 Repair.SysAdmin

r 20:36 0.102 3

Repair.SysAdmin logged out 01/17/18 2036.8 pst Wed
CPU usage 1 sec, memory usage 0.1 units, cost $0.07.
Multics has disconnected you
Connection closed by foreign host.

Meanwhile, you should be able to see the logging information on the Multics ternimal:

<>CONNECT to d.h000
2035 as LOGIN Repair.SysAdmin int d.h000 (create)
2036 as LOGOUT Repair.SysAdmin.a int d.h000 0: 0 $0.07 (logo)

5. Multics 101

All commands can be found at [3]. Go ahead to try “ls”, “who”, “date”, “time”, and etc. There is also an Emacs available (no Vi apparently). The command mapping between Multics and Linux may also be useful to have a quick understanding of Multics commands[4]. For instance, if you want to take a look at the source file of Multics (HINT: neither B nor C):

cwd >ldd>mcs>source
pr ic_sampler.map355

(Another Hint: Ctrl-C can be your friend~)

6. Now what?

Once we “logout”, it is time to learn how to quit Multics (gracefully). Yes, you need to learn how to quit – take a look at [5] and use 5 ESCs and 5 commands to shutdown Multics:

M-> logout * * *


2110 as LOGOUT IO.SysDaemon.z dmn cord 0: 0 $0.03 (looc)
2110 as LOGOUT Backup.SysDaemon.z dmn bk 0: 0 $0.04 (looc)
2110 as LOGOUT IO.SysDaemon.z dmn prta 0: 0 $0.03 (looc)
2110 as LOGOUT Utility.SysDaemon.z dmn ut 0: 3 $1.90 (looc)
2110 as LOGOUT Volume_Dumper.Daemon.z dmn vinc 0: 0 $0.04 (looc)
M-> shut

2110 as act_ctl_: shutdown, 8 0.05 0.05 0.00 0.02 0:3:21 $16.99
2110.9 shutdown complete
DBG(16336227444863)> ERR ERR: Need status register data format
DBG(16336227444911)> ERR ERR: doPayloadChan expected IDCW 10 (12)
bce (boot) 2110.9: M-> die

Do you really wish bce to die? M-> yes

BCE DIS causes CPU halt

simCycles = 16336308761570

cpuCycles = 698896257
Timer runout faults = 21587
Derail faults = 1
Lockup faults = 65
Connect faults = 7242
Illegal procedure faults = 5
Directed fault 0 faults = 754
Directed fault 1 faults = 6926
Access violation faults = 370
Fault tag 2 faults = 3052

Halt, IC: 000012 (000000000000)
sim> quit
[daveti@daveti qs]$



Posted in OS, Security | Tagged , , , | 1 Comment

Some notes on the Monotonic Counter in Intel SGX and ME

SGX sealing is vulnerable to rollback attacks as the enclave is not able to tell if the sealed data is the latest or a old copy. To mitigate this attack, monotonic counter (MC) has been introduced in Intel SGX SDK 1.8. This post looks into some implementation details inside Intel SGX SDK.

1. sgx_create_monotonic_counter[1]

Look at the description of this API, two things show up – first, the counter creation involves non-volatile memory (NVM) writing; second, limited number of MCs can be created per enclave due to the quota. Apparently, unlike the older version of Intel SGX SDK, where MC was only available in the simulation mode, the MC here is based on real NVM, which also seems to have limited capacity. Let’s go deeper.

2. PSE (Platform Service/Security Enclave?)

The MC creation essentially calls sgx_create_monotonic_counter_ex[2] inside SDK. This function builds two pse_message_t messages for the request and reply, which are both passed to crypt_invoke. The crypt_invoke function establishes a PSE session (secure and authenticated channel) with aesmd, encrypts the request using the shared secret, and calls uae_invoke_service with both the request and reply passed, which eventually calls invoke_service_ocall[3] in the PSE provided by PSW other than the SDK.

3. OAL (OS Abstraction Layer?)

The PSE request and reply are then delivered to oal_invoke_service[4], which essentially transforms the PSE messages into AEInvokeServiceRequest and AEInvokeServiceResponse, and sends the AE messages to aesmd via UNIX socket ‘/var/run/aesmd/aesm.socket’. This MC creation request will then be handled by PSE_MC_SERVICE as a PSE_MC_CREATE command[1]

4. PSE_OP (PSE OPeration?)

The service handler table defined in pse_op.cpp[5] lists all the MC related command PSE_MC_SERVICE supports:

static const struct service_handler_t
    uint16_t  service_id;
    uint16_t  service_cmd;
    uint16_t  req_size;
    uint16_t  resp_size;
    srv_pfn_t srv_pfn;
} service_handler[] = {
    {PSE_MC_SERVICE, PSE_MC_CREATE, sizeof(pse_mc_create_req_t), sizeof(pse_mc_create_resp_t), pse_mc_create},
    {PSE_MC_SERVICE, PSE_MC_READ,   sizeof(pse_mc_read_req_t),   sizeof(pse_mc_read_resp_t),   pse_mc_read  },
    {PSE_MC_SERVICE, PSE_MC_INC,    sizeof(pse_mc_inc_req_t),    sizeof(pse_mc_inc_resp_t),    pse_mc_inc   },
    {PSE_MC_SERVICE, PSE_MC_DEL,    sizeof(pse_mc_del_req_t),    sizeof(pse_mc_del_resp_t),    pse_mc_del   },
    {PSE_TRUSTED_TIME_SERVICE, PSE_TIMER_READ,   sizeof(pse_timer_read_req_t), sizeof(pse_timer_read_resp_t), pse_read_timer},

And pse_mc_create is our target handler.

5. VMC (Virtual Monotonic Counter?)

Now we are into create_vmc[6], described as “create a VMC in SQLite Database and return UUID of the VMC to the caller”. Huh, so the MC is essentially an entry in a SQLite DB. So how could a SQLite DB defend against rollback attacks given that it is just a normal file on the disk? Hang on. Let’s continue. When this is the first MC creation request ever happend, we need to create the VMC DB first, via initialize_sqlite_database_file[7]. This very first thing it does is to call read_rpdata[8]. As its file name implies, the SQLite DB accesses hardware (NV) to get the hash of the DB at certain epoch, which is essentailly a HW MC. We are getting close to the real HW…

6. PSDA_Service[9]

This file contains all functions used by the VMC SQLite DB to create/delete/update HW MC in the so-far-unknown HW via invoke_psda_service. An ephemeral session is established in the function to provide secure and authenticated communication with the mystery HW, through psda_invoke_service_ocall[10], and eventually psda_invoke_service[11].

7. Intel ME

Just looking at the psda_invoke_service is really insightful:

    JVM_COMM_BUFFER commBuf;
    commBuf.TxBuf->buffer = psda_req_msg;
    commBuf.TxBuf->length = psda_req_msg_size;
    commBuf.RxBuf->buffer = psda_resp_msg;
    commBuf.RxBuf->length = psda_resp_msg_size;
    int response_code;
    ae_error_t ret;

    ret = PSDAService::instance().send_and_recv(

JVM? JHI? Finally, we have reached the dark side of the MC. Somehow some Java shit involves in communicating with the secret HW[13] — Yes, it is Intel ME! JHI stands for Java Host Interface, which is used to install Java applets into Dynamic Application Loader (DAL) firmware (inside ME) and communicate with them[12]. Yes, the SGX MC is essentially implemented in ME as a Java applet called “PSDA”!

8. PSDA.dalp

PSDAService::install_psda[14] installs the psda applet into ME using JHI. Not surprisingly, the source code for psda is not available. Instead, the script is used to download the applet binary directly, since it needs to signed by Intel — Right! Intel does not want anyone to install any applet into ME, for security reasons!!! After running the download script, PSDA.dalp will appear under psw/ae/data/prebuilt. It is encoded as Base64:


Copying out the Base64 encoded applet and running decoding, we have a chance to peek into the implementation of the applet:
                                  CodeWrapper	Constants#Constants$CSE_PLATFORM_CAPABILITIES!Constants$CSE_SERVICE_RESP_STATUS!Constants$EPHEMERAL_SESSION_STATEConstants$LT_SESSION_STATEConstants$MEDIAPATH_IDonstants$PSDA_COMMARPDataServiceServiceFactoryServiceMessageFactoryTrustedTimeServiceServiceHandlerPSDACommonMessage$PSDA_SERVICE_ID
EphSessionEphSessionManager	LTSessionLTSessionManager
                                                                     DALApiProxy	DebugUtil	FlashUtilSGXMonotonicCounterProxyCryptoException	EpidAlgExHashAlgIllegalParameterExceptionNotSupportedExceptionOperationFailedExceptionRandom
UtilExceptionimeExceptionObjectString                 IntelAppletSEMonotonicCounter
StringBuilder   Throwable            StringBuffer
pairing_sk	payloadIv
timerEpochversionwrappedEncryptKeabandonSessionclearcreateServiceCommandcreateServiceMessage	deriveIDseraseAllFlashData
                                                                                                             printreadFlashDatareadInreadLongonprocessLTSessioreadByte	readBytes
readObjectreadPairingCounter	readShortresetresponseWithMessageseekset!setFlashStoragePropertyAntiReplay
                                                                                                          setResponse	setSeqNumsetTimesigmaSetPropertyAppendSVNsizestartEphemeralSessionstartLongTermSession
               updateRPData	writeByte
writeByteswriteFlashData_0writeFlashData_writeInt	writeLong

9. Intel ME Again

Although we only looked at the MC creation in SGX, this post has been long enough. Last thing about SGX MC is still about Intel ME. If you have seen dmesg like “Device doesn’t have valid ME Interface”, and neither /dev/mei nor /dev/mei0 is created, it is very likely that you are running a server platform with a different ME firmware (SPS) comparing to desktops/laptops[15], and the corresponding DAL firmware is not available in SPS. This means SGX MC is NOT available for servers!!! Well, I guess this does not really matter as people are in a rush to disable ME anyway… On the other hand, implementing SGX MC using ME seems to be the most reasonable choice (at least, we are still inside the CPU package), although this makes “SGX defends against ME attacks” statement in vain.



Posted in Security | Tagged , , , | 12 Comments

Why list and charlist are confusing in Elixir

This post talks about list and charlist in Elixir, and discusses one specific issue in Elixir.

Let’s look at the confusing thing in Elixir:

iex(2)> a = [7]
iex(3)> b = ‘\a’
iex(4)> a == b
iex(5)> is_list(a)
iex(6)> is_list(b)

First, a list is shown as a charlist in iex; second, (without surprise if we believe in iex), the charlist equals the list; last, both the list and the charlist are lists.

Why? Because charlist in represented as a list with code point(s) inside, and those encoding scheme is UTF-8 (rather than UTF-16 in Erlang). To convince yourself:

iex(7)>, fn x -> [x] end)
[[0], [1], [2], [3], [4], [5], [6], ‘\a’, ‘\b’, ‘\t’, ‘\n’, ‘\v’, ‘\f’, ‘\r’,
[14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26],
‘\e’, [28], [29], [30], [31], ‘ ‘, ‘!’, ‘”‘, ‘#’, ‘$’, ‘%’, ‘&’, ‘\”, ‘(‘,
‘)’, ‘*’, ‘+’, ‘,’, ‘-‘, ‘.’, ‘/’, ‘0’, ‘1’, …]

Looks familiar? Yes, those are ASCII codes. So the conclusion seems to be that list and charlist are the same in Elixir. Well, this is mostly right, as we have seen, but not exactly right. Run the file below to inspect into list and charlist internal using pry:

require IEx

value_list = [7]
value_list2 = [1]
value_char = '\a'


daveti@Daves-MacBook-Pro:~/elixir/list_issue$ iex list_issue.exs
Erlang/OTP 20 [erts-9.0] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Request to pry #PID<0.83.0> at list_issue.exs:7

5: value_char = ‘\a’
7: IEx.pry

Allow? [Yn]
Interactive Elixir (1.5.1) – press Ctrl+C to exit (type h() ENTER for help)
pry(1)> i value_list
Data type
This is a list of integers that is printed as a sequence of characters
delimited by single quotes because all the integers in it represent valid
ASCII characters. Conventionally, such lists of integers are referred to as
“charlists” (more precisely, a charlist is a list of Unicode codepoints,
and ASCII is a subset of Unicode).
Raw representation
Reference modules
Implemented protocols
IEx.Info, Collectable, Enumerable, Inspect, List.Chars, String.Chars
pry(3)> i value_list2
Data type
Reference modules
Implemented protocols
IEx.Info, Collectable, Enumerable, Inspect, List.Chars, String.Chars
pry(5)> i value_char
Data type
This is a list of integers that is printed as a sequence of characters
delimited by single quotes because all the integers in it represent valid
ASCII characters. Conventionally, such lists of integers are referred to as
“charlists” (more precisely, a charlist is a list of Unicode codepoints,
and ASCII is a subset of Unicode).
Raw representation
Reference modules
Implemented protocols
IEx.Info, Collectable, Enumerable, Inspect, List.Chars, String.Chars

Now you see the difference between list and charlist. Both have data type list, which explains why they are equivalent. Moreover, when the numbers inside the list could be interpreted as valid ASCII code points, Elixir shows it as a charlist rather than a list, although the raw presentation is always a list.

Why this is confusing? Let’s say you wanna print out a list, e.g., [7], but what Elixir displays is ‘\a’. Although it does not affect the internal computation, e.g., map, this display issue is really annoying and confusing. In this case, Erlang seems doing sane:

Eshell V9.0 (abort with ^G)
1> [7].
2> ‘\a’.
3> [7] == ‘\a’.

Posted in Programming | Tagged , , , , | Leave a comment