Kernel Code Execution Time Measurement (kcetm)

This post mainly talks about the correct usage of tsc counters provided by Intel x86/x86-64 architectures to measure the Linux kernel code execution time. Most of the content here is borrowed/inspired from [1]. Note that this is NOT a post introduing tsc but for people who need to use tsc anyway. A kernel module with us/ns/tsc measurement framework ready is also provided at the end of the post [4]. Happy measurement!

1. Why measuring time is hard?

Human beings have a long history of trying to measure time [3], although this is not our focus. We are talking about measuring a piece of code execution running at ring 0 on a modern x86/x86-64 Intel CPU. The measurement framework usually looks like below:


The most common measurement functions inside the kernel are do_gettimeofday(), which returns in macroseconds, and getnstimeofday64(), which returns in nanoseconds. Depending on the complexity of the code needs to be measured, different scaling measurement should be applied. E.g., measuring a sub macrosecond function using do_gettimeofday() is inappropreiate. Similarly, measuring a minute level function using getnstimeofday64() is also an overkill.

In most cases, when the piece of code measured is non-trivial and takes a substantial amount of time (e.g., miliseconds), these gettimeofday functions do their job. We usually ignore the delay caused by interrupt handlings, preemptions, and schedulings. However, when the code to be measured is really small (e.g., tens of assemblies), the delay caused by interrupt handlings, preemptions, and schedulings cannot be ignored anymore, since we are measuring a nanosecond level execution time. In this case, we need to disable (local) interrupt, and preemption, making sure the measurement is “atomic” without external interferences. The other thing we need is tsc (timestamp counter).

2. What is the right way to use tsc?

Here we ignore the mov instructions needed to read/merge the tsc values. Instead, we focus on different usage patterns of rdtsc and rdtscp. We explore these patterns and point out the potential issues.

Pattern A:

...       A
...       |
rdtsc;    V
rdtsc;    A
...       |
...       V

This might be the most straightforward way of using tsc. We call rdtsc to read the counter before and after the code_to_measure. Unfortunately, because of out of order execution of CPU, it is hard to guarantee that the code_to_measure within the rdtsc blocks is exactly the code we wanna measure. It turns out either the code belong to code_to_measure could be executed by the CPU ahead of the first rdtsc or after the second rdtsc, or whatever code before the first rdtsc or after the second rdtsc could be measured as part of the code_to_measure. In short, this pattern gives the measurement of timing(some_portion(code_to_measure) + some_portion(code_to_measure_before) + some_portion(code_to_measure_after)).

Pattern B:

...       A
...       X
...       V
rdtsc;    | A
...       X |
...       | |
...       V |

To surpress the effect of out of order execution, we need a processor barrier to guarantee that code_to_measure is exactly the code we wanna measure when we setup the tsc counters – no more or less (hopefully). CPUID is one such instruction that serializes the instruction flow. The first CPUID guarantees that all instructions before it have done; the second CPUID guarantees that code_to_measure is also done before we execute the second rdtsc. However, it is still possible that the code after the second rdtsc get executed after the second CPUID but before we hit the second rdtsc (we mark this as delta). Similarly, the first rdtsc, should be execuated as the first instruction once the CPUID is done. However, out of order execution within the CPUID block is still possible, making few instructions of code_to_measure execute before the first rdtsc (we mark this as the other delta). Fortunately, because rdtsc uses RAX and RDX the same, it is possible that the CPU would get stalled when it tries to excute instructions followed by rdtsc due to data hazard. Nevertheless, the biggest problem comes from the extra CPUID instruction measured. Per [1], the exeuction time of CPUID itself is NOT stable, ranging from tens of CPU cycles to hundreds. This poses issues when the code_to_measure has similar timing scale. In short, this pattern gives the measurement of timing(code_to_measure + cpuid – delta_1 + delta_2).

Pattern C:

...       A
...       X
...       V
...       A
...       X
...       V

Comparing with the previous pattern, we add CPUID after each rdtsc. The first added CPUID gurantees that rdtsc is executed before any instructions from code_to_measure; the last CPUID added guarantees that the last rdtsc is executed after code_to_measure is done and before whatever code followed. This solution elimites the measurement deltas in the price of the other extra CPUID in the measurement block. In short, this pattern gives the measurement of timing(code_to_measure + 2*cpuid).

Pattern D:

...       A
...       X
...       V
...       A
...       X
...       V

At this point, I hope it is clear that the first CPUID has to happen before the first rdtsc (otherwise we may include the code above into the measurement block since the first rdtsc is not guaranteed to get executed right before the CPUID). Moving the second rdtsc before the second CPUID gurantees that the code after the second CPUID will not get measured. There is also no extra CPUID added into the measurement block. Unfortunately, this does not work due to the possible out of order execution insided the CPUID block. In short, this pattern gives the measurement of timing(some_portion(code_to_measure)).

Pattern E:

...       A
...       X
...       V
rdtscp;   | A
...       X |
...       | |
...       V |

To deal with the measurement variance introduced by the usage of CPUID, rdtscp is introduced in later CPUs, which is essentially a combination of CPUID+rdtsc. This instruction is also able to guarantee that by the execution of itself, all the instructions before it have been done. Unfortunately, the instructions after rdtscp may still get executed before rdtscp itself. In short, this pattern gives the measurement of timing(code_to_measure – delta_1 + some_portion(code_after_rdtscp)).

Pattern F:

...       A
...       X
...       V
...       A
...       X
...       V

By appending another CPUID after rdtscp, now the code after rdtscp cannot get executed before rdtscp. In short, this pattern gives the measurement of timing(code_to_measure – delta_1). If we ignore delta here, this pattern provides the most “accurate” measurement so far.

Pattern G:

...       X
...       | A
...       V |
rdtscp;     |
...       A
...       X
...       V

Comparing to the previous pattern, we replaced the first CPUID+rdtsc with rdtscp. While the first rdtscp guarantees that all the instructions before itself have been done, it cannot stop the CPU from executing some code from code_to_measure before rdtscp. To fix this problem, we can add a CPUID right after the first rdtscp, which unfortunately defeats the purpose of avoiding CPUID inside the measurement block. In short, this pattern gives the measurement of timing(some_portion(code_to_measure)).

3. Discussion

There are other instructions like CPUID, which can serialize the instruction flow, such as lfense and mfense (which are also used to defend Meldown and Spectre attacks). For user space measurement using tsc, [2] provides examples. Note, however, the measurement from user space includes the noise from interrupt handling, preemption, and scheduling.

4. Conclusion

It is easy to use tsc, but hard to use it right. Even if we use it right (e.g., Pattern F), we still could not guarantee the measurement is 100% right, but we are getting pretty close. The kcetm [5] implements Pattern F in a kernel module and is free to use.



Posted in Dave's Tools, OS, Programming | Tagged , , , , , , , , | 1 Comment

Rowhammer Pine64

Rowhammer attacks have been well known, and gotten a lot of publications already. However, we notice that most rowhammers happened on x86 architecture due to the easy access to clflush from the user space. ARM architecture (both ARMv7 and ARMv8) is also explored by DRAMA [1], where Android Ion memory allocator is leveraged to map un-cached pages into the user space. The reason is the cache line flush instruction in ARMv7 is ring-0 instruction. However, ARMv8 exposes cache line flush instruction to the user space again, providing an opportunity to rowhammer from the user space. This post explores rowhammer attacks using ARMv8 cache line flush instructions from the user space using Pine64 board. Let’s start~

1. Setup Pine64

Pine64 [2] is a cheap ARM 64-bit SoC. The one we are using is Pine A64+, with a quad-core ARM Cortex A53 64-Bit processor, and 2G LPDDR3. Both the support for ARMv8 ISA and equipment with DDR3 memory make this board a desired target. However, setting up this board really took us a while. We have encoutered both hardware and software issues on this board – I guess this is you get what you paid:(

Issue-I: Serial console on the Ext connector does NOT work!

As a result, none of the Linux images we tried (Ubuntu, Arch, OpenSUSE, Fedora) worked. In most cases, the booting process stuck at the uboot shell, and the terminal (screen/minicom) got scrambled characters.

Solution: Use the serial console on the Euler bus [3].


Issue-II: Ubuntu/Arch/OpenSUSE images do NOT work!

It is really frustrating for us that major images from Pine A64 software release [4] just do not work. In the best cases, we were able to boot into the kernel but stuck at login. Or the board kept rebooting itself. It is still not clear for us why these mainstream images do not work on our board.

Solution: Use Fedora ARM aarch64 image [5]!

Although Fedora support is never official for Pine64, as one post [6] pointed out that Fedora 27 Minimum image works with some tweaks. We verified that it indeed works on our board. Further, we use Fedora 27 Workstation on our board, and it works like a charm. To easily burn a Fedora image into SD card, we recommend a Fedora host machine with the following procedures:

sudo dnf install uboot-images-armv8 fedora-arm-installer
sudo fedora-arm-image-installer --image=Fedora-Workstation-27-1.6.aarch64.raw.xz --target=pine64_plus --media=/dev/sdb --norootpass

Note that “–norootpass” is needed for Workstation since no user will be created during the kernel booting. “–resizefs” should also be useful to make full usage of the SD card. For more information about arm-image-installer, please refer to [7].

Once we boot into the Linux kernel, we should be able to see things below:

[daveti@localhost ~]$ uname -a
Linux localhost.localdomain 4.13.9-300.fc27.aarch64 #1 SMP Mon Oct 23 13:33:18 UTC 2017 aarch64 aarch64 aarch64 GNU/Linux
[daveti@localhost ~]$ cat /proc/cpuinfo
processor : 0
BogoMIPS : 48.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4

processor : 1
BogoMIPS : 48.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4

processor : 2
BogoMIPS : 48.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4

processor : 3
BogoMIPS : 48.00
Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 cpuid
CPU implementer : 0x41
CPU architecture: 8
CPU variant : 0x0
CPU part : 0xd03
CPU revision : 4

[daveti@localhost ~]$

2. Rowhammer from the user space

One way we decided to do rowhammer is from the user space, without the knowledge of how virtual address, physical address, bank, and row are mapped inside the DRAM. Our goal is to launch this attack as a normal user without any privillge. We also limit our goal to find bit flips within DRAM, rather than developing ful exploitations. We built our tool based on Google’s rowhammer-test, extending the support for ARMv8 64-bit (aarch64) architecture.

Unlike ARMv7, ARMv8 exposes cache line flush instructions to the user space. This provides us an opportunity to find the replacement for “clflush” for ARMv8. Unfortunately, cache line flushing in ARMv8 is not that straightforward as x86. Instead of just one instruction, more than 10 instructions are needed to flush data cache. One can refer to uboot “__asm_flush_dcache_range” [8], and ARMv8-A Programmers Guide [9]. To accelerate the cache line flush, we move the computation for cache line mask out of loops, leaving 4 instructions in total (comparing to 1 clflush). The detailed code change can be found at [10]. Note that cache line mask needs to be applied to the virtual address before flushing. We tried direct “dc civac %[addr]”, and got kernel Oops:

[ 3484.162251] rowhammer_test[9281]: unhandled level 0 translation fault (11) at 0xffffffffffffffff, esr 0x92000144, in rowhammer_test[400000+2000]
[ 3484.175317] CPU: 1 PID: 9281 Comm: rowhammer_test Not tainted 4.13.9-300.fc27.aarch64 #1
[ 3484.183453] Hardware name: sunxi sunxi/sunxi, BIOS 2017.09 10/10/2017
[ 3484.189901] task: ffff800065d90000 task.stack: ffff800063d1c000
[ 3484.195853] PC is at 0x400ad8
[ 3484.198830] LR is at 0x400a8c
[ 3484.201822] pc : [<0000000000400ad8>] lr : [<0000000000400a8c>] pstate: 60000000
[ 3484.209228] sp : 0000ffffee7a9010
[ 3484.212573] x29: 0000ffffee7a9060 x28: 0000ffffee7a9050
[ 3484.217894] x27: 0000ffffee7a9010 x26: 0000ffffee7a9050
[ 3484.223232] x25: 0000ffffee7a9060 x24: 0000000000000001
[ 3484.228553] x23: 000000000000000a x22: 0000000000400000
[ 3484.233890] x21: 0000000000400e40 x20: 0000000000400f40
[ 3484.239211] x19: 0000000000420000 x18: 0000000000000020
[ 3484.244548] x17: 0000ffffa3f0d3d0 x16: 0000000000420040
[ 3484.249869] x15: 0000ffffa3ed4a60 x14: 0003c5c40caf2f6b
[ 3484.255206] x13: 00000003e8000000 x12: 0000ffffa3ed5c34
[ 3484.260543] x11: 0000ffffa3ee3dd8 x10: 0000ffffa3ed6d28
[ 3484.265862] x9 : 003b9aca00000000 x8 : 0000000000000000
[ 3484.271199] x7 : 00000000e961b8a2 x6 : 0000ffffa40730d0
[ 3484.276519] x5 : 0000ffffa4073080 x4 : 0000000000083d60
[ 3484.281855] x3 : 0000ffff9843a000 x2 : 0000ffffee7a9018
[ 3484.287176] x1 : 0000000000000000 x0 : ffffffffffffffff
[ 5689.587275] rowhammer_test[9536]: unhandled level 2 translation fault (11) at 0x00000000, esr 0x92000146, in rowhammer_test[400000+2000]
[ 5689.599618] CPU: 1 PID: 9536 Comm: rowhammer_test Not tainted 4.13.9-300.fc27.aarch64 #1
[ 5689.607753] Hardware name: sunxi sunxi/sunxi, BIOS 2017.09 10/10/2017
[ 5689.614201] task: ffff800065d95a00 task.stack: ffff800079290000
[ 5689.620154] PC is at 0x400ad8
[ 5689.623131] LR is at 0x400a8c
[ 5689.626106] pc : [<0000000000400ad8>] lr : [<0000000000400a8c>] pstate: 20000000
[ 5689.633531] sp : 0000ffffe3ae6600
[ 5689.636853] x29: 0000ffffe3ae6650 x28: 0000ffffe3ae6640
[ 5689.642190] x27: 0000ffffe3ae6600 x26: 0000ffffe3ae6640
[ 5689.647531] x25: 0000ffffe3ae6650 x24: 0000000000000001
[ 5689.652852] x23: 000000000000000a x22: 0000000000400000
[ 5689.658190] x21: 0000000000400e40 x20: 0000000000400f40
[ 5689.663509] x19: 0000000000420000 x18: 0000000000000020
[ 5689.668846] x17: 0000ffff86a133d0 x16: 0000000000420040
[ 5689.674167] x15: 0000ffff869daa60 x14: 002256158d7d8201
[ 5689.679506] x13: 00000003e8000000 x12: 0000ffff869dbc34
[ 5689.684827] x11: 0000ffff869e9dd8 x10: 0000ffff869dcd28
[ 5689.690164] x9 : 003b9aca00000000 x8 : 0000000000000000
[ 5689.695483] x7 : 00000000e961b8a2 x6 : 0000ffff86b790d0
[ 5689.700822] x5 : 0000ffff86b79080 x4 : 0000000000083d60
[ 5689.706143] x3 : 0000000000000000 x2 : 0000ffffe3ae6670
[ 5689.711481] x1 : 0000000000000000 x0 : 0000000000000000
[daveti@localhost ~]$

3. Cache line flushing is expensive!

We ran our code with cache line flushing enabled and disabled (Note that bit flips do NOT depend on cache line flushing; however, the detection of rowhammer needs it). We show some samples runs of rowhammer attacks here:

[daveti@localhost rowhammer-test]$ ./rowhammer_test
cache line mask [0x3f]
Iteration 0 (after 0.00s)
Took 1007.3 ms per address set
Took 10.0728 sec in total for 10 address sets
Took 233.167 nanosec per memory access (for 43200000 memory accesses)
This gives 34310 accesses per address per 64 ms refresh period
Checking for bit flips took 1.095023 sec
Iteration 1 (after 11.17s)
Took 1006.9 ms per address set
Took 10.069 sec in total for 10 address sets
Took 233.080 nanosec per memory access (for 43200000 memory accesses)
This gives 34323 accesses per address per 64 ms refresh period
Checking for bit flips took 1.092644 sec
Iteration 2 (after 22.33s)
Took 1009.1 ms per address set
Took 10.0913 sec in total for 10 address sets
Took 233.595 nanosec per memory access (for 43200000 memory accesses)
This gives 34247 accesses per address per 64 ms refresh period
Checking for bit flips took 1.092391 sec
Iteration 3 (after 33.51s)
Took 1005.2 ms per address set
Took 10.052 sec in total for 10 address sets
Took 232.685 nanosec per memory access (for 43200000 memory accesses)
This gives 34381 accesses per address per 64 ms refresh period
Checking for bit flips took 1.081366 sec 

[daveti@localhost rowhammer-test]$ ./rowhammer_test
cache line mask [0x3f]
Iteration 0 (after 0.00s)
Took 90.4 ms per address set
Took 0.903528 sec in total for 10 address sets
Took 20.915 nanosec per memory access (for 43200000 memory accesses)
This gives 382500 accesses per address per 64 ms refresh period
Checking for bit flips took 1.093423 sec
Iteration 1 (after 2.00s)
Took 121.6 ms per address set
Took 1.21641 sec in total for 10 address sets
Took 28.158 nanosec per memory access (for 43200000 memory accesses)
This gives 284113 accesses per address per 64 ms refresh period
Checking for bit flips took 1.096945 sec
Iteration 2 (after 4.31s)
Took 104.5 ms per address set
Took 1.04498 sec in total for 10 address sets
Took 24.189 nanosec per memory access (for 43200000 memory accesses)
This gives 330723 accesses per address per 64 ms refresh period
Checking for bit flips took 1.089356 sec
Iteration 3 (after 6.45s)
Took 133.6 ms per address set
Took 1.33556 sec in total for 10 address sets
Took 30.916 nanosec per memory access (for 43200000 memory accesses)
This gives 258768 accesses per address per 64 ms refresh period
Checking for bit flips took 1.089574 sec
Iteration 4 (after 8.87s)
Took 122.0 ms per address set
Took 1.22025 sec in total for 10 address sets
Took 28.247 nanosec per memory access (for 43200000 memory accesses)
This gives 283220 accesses per address per 64 ms refresh period
Checking for bit flips took 1.077736 sec 

The first one enabled cache flusing; the second disabled it. Comparing the results, the cache flushing in ARMv8 took around 200 nanosec per memory access. That means 200 CPU cycles for a 1 GHz CPU.



Posted in Linux Distro, OS, Security | Tagged , , , , , | Leave a comment

“make deb-pkg” broken

Last time when I hacked the Linux kernel on Ubuntu, it was 4.4 on LTS 14.04. Time flies. Now I need to hack the kernel 4.13 again on LTS 16.04, and find the kernel build broken. It is fine that we could not have a stable kernel API since at least the ABI is stable. But what about kernel build?

1. “make deb-pkg” broken

This used to be my way to build a kernel on Ubuntu. If something wrong happens and stops the build, just fix the error and repeat the command again — it was incremental build. But now it is broken, as it launches “make clean” by default. Why? Because it needs to generate a source tar at first, which requires a “clean” to exclude all build artifacts[1]. Instead, use “make bindeb-pkg“.

2. VirtualBox build broken

I have no idea when Ubuntu decided to include VirtualBox guest/drivers into the kernel tree, but it is broken, because the source directory structure is different from the ones used by the makefiles. An “ln” based workaround is available[2], although I just hacked those makefiles to fix errors.


Posted in Linux Distro | Tagged , , | Leave a comment

Running Multics on Linux (Fedora 27)

This post follows the “Multics Simulator Instructions”[1] (with some tweaks) to setup Multics simulator dps8m and run Multics on my Fedora 27. Other Linux distro (Ubuntu/Debian/Raspbian) may need some changes but basically work the same way. Experience the cutting-edge secure operating system made on 1960s by yourself!


1. Install deps – libuv

While libuv can be built from source[2], it would be more easier to install the library using your package manager. For Fedora 27:

dnf install libuv libuv-devel

2. Build the simulator – dps8

Once libuv and its header files are installed, go ahead to build the simulator following the instructions[2]:

git clone git:// dps8m-code
cd dps8m-code
git checkout R1.0

Once the build is done, the dps8 binary can be found under dps8m-code/src/dps8.

3. Run dps8

mkdir -p $HOME/multics/linux
cp $HOME/dps8m-code/src/dps8/dps8 $HOME/multics/linux
cp $HOME/dps8m-code/src/dps8/Devices.txt $HOME/multics/linux

This should start the simulator. Type “quit” to exit the simulator:

[daveti@daveti linux]$ ./dps8
DPS8/M emulator (git b7a50ffc)
Production build
DPS8M system session id is 18555
Please register your system at
or create the file ‘serial.txt’ containing the line ‘sn: 0’.
FNP telnet server port set to 6180

DPS8M simulator V4.0-0 Beta git commit id: c420925a
sim> quit
[daveti@daveti linux]$

4. Run Multics

Download the QuickStart disk image, unzip it, and mv/rename disk image directory as multics/qs:

unzip $HOME/QuickStart*.zip
mv $HOME/QuickStart_MR12.6f $HOME/multics/qs
cd $HOME/multics/qs
../linux/dps8 MR12.6f_boot.ini

The whole booting process takes a while, but you should be able to see something below if Multics starts correctly:

[daveti@daveti qs]$ ../linux/dps8 MR12.6f_boot.ini
DPS8/M emulator (git b7a50ffc)
Production build
DPS8M system session id is 18555
Please register your system at
or create the file ‘serial.txt’ containing the line ‘sn: 0’.
FNP telnet server port set to 6180

DPS8M simulator V4.0-0 Beta git commit id: c420925a
TAPE: unit is read only
bootload_0: Booting system MR12.6f generated 01/09/17 1119.1 pst Mon.
0426.6 announce_chwm: 428. pages used of 512. in wired environment.
0426.6 announce_chwm: 706. words used of 1024. in int_unpaged_page_tables.
find_rpv_subsystem: Enter RPV data: M-> [auto-input] rpv a11 ipc 3381 0a

0426.7 load_mst: 946. out of 1048. pages used in disk mst area.
bce (early) 0426.7: M-> [auto-input] bce

Multics Y2K. System was last shudown/ESD at:
Wednesday, January 17, 2018 17:07:31 pst
Current system time is: Wednesday, January 17, 2018 20:26:43 pst.
Is this correct? M-> [auto-input] yes

The current time is more than the supplied boot_delta hours beyond the
unmounted time recorded in the RPV label. Is this correct? M-> [auto-input] yes

bce (boot) 2026.7: M-> [auto-input] yes

bce: Unrecognizable request. Type lr for a list of requests.
bce (boot) 2026.7: M-> [auto-input] boot star

Multics MR12.6f – 01/17/18 2027.0 pst Wed
2027.0 Loading FNP d, >user_dir_dir>SysAdmin>a>mcs.7.6c>site_mcs 7.6c
Received BOOTLOAD command…
listening to 6180
2027.0 FNP d loaded successfully

scavenge_vol: No volumes found
2027 as as_init_: Multics MR12.6f; Answering Service 17.0
2027 as LOGIN IO.SysDaemon dmn cord (create)
2027 as LOGIN Backup.SysDaemon dmn bk (create)
2027 as LOGIN IO.SysDaemon dmn prta (create)
2027 as LOGIN Utility.SysDaemon dmn ut (create)
2027 as LOGIN Volume_Dumper.Daemon dmn vinc (create)
fnpuv_open_slave 3.31
listening on port 6131
2027 as as_mcs_mpx_: Load signalled for FNP d.

2027 cord Enter command: coordinator, driver, or logout:
–> cord
2027 bk
2027 prta Enter command: coordinator, driver, or logout:
–> prta
2027 ut copy_dump: Attempt to re-copy an invalid dump.
2027 bk r 20:27 0.446 37
2027 bk
–> bk
2027 vinc
2027 vinc r 20:27 0.450 25
2027 vinc
–> vinc
2027 as sc_admin_command_: Utility.SysDaemon.z: delete_old_pdds
2027 ut send_admin_command: Execution started …
2027 ut completed.
2027 ut monitor_quota: The requested action was not performed.
2027 ut The quota of >dumps is 0, a record limit needs to be specified.
2027 ut
2027 ut Records Left % VTOCEs Left % PB/PD LV Name
2027 ut
2027 ut 166172 100086 60 42218 34130 81 pb root
2027 ut
2027 ut r 20:27 2.481 498
2027 ut
–> ut
2027.2 RCP: Attached tapa_00 for Utility.SysDaemon.z
2027.2 RCP: Detached tapa_00 from Utility.SysDaemon.z
2027.2 RCP: Attached rdra for Utility.SysDaemon.z
2027.2 RCP: Detached rdra from Utility.SysDaemon.z
2027.2 RCP: Attached puna for Utility.SysDaemon.z
2027.2 RCP: Detached puna from Utility.SysDaemon.z
2027.2 RCP: Attached prta for Utility.SysDaemon.z
2027.2 RCP: Detached prta from Utility.SysDaemon.z

Once we have Multics running, we can try to login the system with the help of telnet and the login command in a new terminal. For the first-time login as “Repair” user, the default passwd is “repair”. A new passwd is asked immediately by the system used to replace the old passwd. NOTE: you need to create a new passwd everytime logging into the system! Once logged in, type simple “who” command, and then “logout”:

[daveti@daveti ~]$ telnet localhost 6180
Trying ::1…
telnet: connect to address ::1: Connection refused
Connected to localhost.
Escape character is ‘^]’.
HSLA Port (d.h000,d.h001,d.h002,d.h003,d.h004,d.h005,d.h006,d.h007,d.h008,d.h009,d.h010,d.h011,d.h012,d.h013,d.h014,d.h015,d.h016,d.h017,d.h018,d.h019,d.h020,d.h021,d.h022,d.h023,d.h024,d.h025,d.h026,d.h027,d.h028,d.h029)?
Attached to line d.h000

Multics MR12.6f: Installation and location (Channel d.h000)
Load = 5.0 out of 90.0 units: users = 5, 01/17/18 2035.3 pst Wed

login Repair -cpw
New Password:
New Password Again:
Password changed.
You are protected from preemption.
Repair.SysAdmin logged in 01/17/18 2035.6 pst Wed from ASCII terminal “none”.

New messages in message_of_the_day:

Welcome to the Multics System.

print_motd: Created >user_dir_dir>SysAdmin>Repair>Repair.value.
r 20:35 0.469 33

who -a -lg

Multics MR12.6f; Installation and location
Load = 6.0 out of 90.0 units; users = 6, 1 interactive, 5 daemons.
Absentee users = 0 background; Max background absentee users = 3
System up since 01/17/18 2027.0 pst Wed
Last shutdown was at 01/17/18 1707.4 pst Wed

Login at TTY Load User ID

01/17/18 20:27 cord 1.0 IO.SysDaemon
20:27 bk 1.0 Backup.SysDaemon
20:27 prta 1.0 IO.SysDaemon
20:27 ut 1.0 Utility.SysDaemon
20:27 vinc 1.0 Volume_Dumper.Daemon
20:35 none 1.0 Repair.SysAdmin

r 20:36 0.102 3

Repair.SysAdmin logged out 01/17/18 2036.8 pst Wed
CPU usage 1 sec, memory usage 0.1 units, cost $0.07.
Multics has disconnected you
Connection closed by foreign host.

Meanwhile, you should be able to see the logging information on the Multics ternimal:

<>CONNECT to d.h000
2035 as LOGIN Repair.SysAdmin int d.h000 (create)
2036 as LOGOUT Repair.SysAdmin.a int d.h000 0: 0 $0.07 (logo)

5. Multics 101

All commands can be found at [3]. Go ahead to try “ls”, “who”, “date”, “time”, and etc. There is also an Emacs available (no Vi apparently). The command mapping between Multics and Linux may also be useful to have a quick understanding of Multics commands[4]. For instance, if you want to take a look at the source file of Multics (HINT: neither B nor C):

cwd >ldd>mcs>source
pr ic_sampler.map355

(Another Hint: Ctrl-C can be your friend~)

6. Now what?

Once we “logout”, it is time to learn how to quit Multics (gracefully). Yes, you need to learn how to quit – take a look at [5] and use 5 ESCs and 5 commands to shutdown Multics:

M-> logout * * *


2110 as LOGOUT IO.SysDaemon.z dmn cord 0: 0 $0.03 (looc)
2110 as LOGOUT Backup.SysDaemon.z dmn bk 0: 0 $0.04 (looc)
2110 as LOGOUT IO.SysDaemon.z dmn prta 0: 0 $0.03 (looc)
2110 as LOGOUT Utility.SysDaemon.z dmn ut 0: 3 $1.90 (looc)
2110 as LOGOUT Volume_Dumper.Daemon.z dmn vinc 0: 0 $0.04 (looc)
M-> shut

2110 as act_ctl_: shutdown, 8 0.05 0.05 0.00 0.02 0:3:21 $16.99
2110.9 shutdown complete
DBG(16336227444863)> ERR ERR: Need status register data format
DBG(16336227444911)> ERR ERR: doPayloadChan expected IDCW 10 (12)
bce (boot) 2110.9: M-> die

Do you really wish bce to die? M-> yes

BCE DIS causes CPU halt

simCycles = 16336308761570

cpuCycles = 698896257
Timer runout faults = 21587
Derail faults = 1
Lockup faults = 65
Connect faults = 7242
Illegal procedure faults = 5
Directed fault 0 faults = 754
Directed fault 1 faults = 6926
Access violation faults = 370
Fault tag 2 faults = 3052

Halt, IC: 000012 (000000000000)
sim> quit
[daveti@daveti qs]$



Posted in OS, Security | Tagged , , , | 1 Comment

Some notes on the Monotonic Counter in Intel SGX and ME

SGX sealing is vulnerable to rollback attacks as the enclave is not able to tell if the sealed data is the latest or a old copy. To mitigate this attack, monotonic counter (MC) has been introduced in Intel SGX SDK 1.8. This post looks into some implementation details inside Intel SGX SDK.

1. sgx_create_monotonic_counter[1]

Look at the description of this API, two things show up – first, the counter creation involves non-volatile memory (NVM) writing; second, limited number of MCs can be created per enclave due to the quota. Apparently, unlike the older version of Intel SGX SDK, where MC was only available in the simulation mode, the MC here is based on real NVM, which also seems to have limited capacity. Let’s go deeper.

2. PSE (Platform Service/Security Enclave?)

The MC creation essentially calls sgx_create_monotonic_counter_ex[2] inside SDK. This function builds two pse_message_t messages for the request and reply, which are both passed to crypt_invoke. The crypt_invoke function establishes a PSE session (secure and authenticated channel) with aesmd, encrypts the request using the shared secret, and calls uae_invoke_service with both the request and reply passed, which eventually calls invoke_service_ocall[3] in the PSE provided by PSW other than the SDK.

3. OAL (OS Abstraction Layer?)

The PSE request and reply are then delivered to oal_invoke_service[4], which essentially transforms the PSE messages into AEInvokeServiceRequest and AEInvokeServiceResponse, and sends the AE messages to aesmd via UNIX socket ‘/var/run/aesmd/aesm.socket’. This MC creation request will then be handled by PSE_MC_SERVICE as a PSE_MC_CREATE command[1]

4. PSE_OP (PSE OPeration?)

The service handler table defined in pse_op.cpp[5] lists all the MC related command PSE_MC_SERVICE supports:

static const struct service_handler_t
    uint16_t  service_id;
    uint16_t  service_cmd;
    uint16_t  req_size;
    uint16_t  resp_size;
    srv_pfn_t srv_pfn;
} service_handler[] = {
    {PSE_MC_SERVICE, PSE_MC_CREATE, sizeof(pse_mc_create_req_t), sizeof(pse_mc_create_resp_t), pse_mc_create},
    {PSE_MC_SERVICE, PSE_MC_READ,   sizeof(pse_mc_read_req_t),   sizeof(pse_mc_read_resp_t),   pse_mc_read  },
    {PSE_MC_SERVICE, PSE_MC_INC,    sizeof(pse_mc_inc_req_t),    sizeof(pse_mc_inc_resp_t),    pse_mc_inc   },
    {PSE_MC_SERVICE, PSE_MC_DEL,    sizeof(pse_mc_del_req_t),    sizeof(pse_mc_del_resp_t),    pse_mc_del   },
    {PSE_TRUSTED_TIME_SERVICE, PSE_TIMER_READ,   sizeof(pse_timer_read_req_t), sizeof(pse_timer_read_resp_t), pse_read_timer},

And pse_mc_create is our target handler.

5. VMC (Virtual Monotonic Counter?)

Now we are into create_vmc[6], described as “create a VMC in SQLite Database and return UUID of the VMC to the caller”. Huh, so the MC is essentially an entry in a SQLite DB. So how could a SQLite DB defend against rollback attacks given that it is just a normal file on the disk? Hang on. Let’s continue. When this is the first MC creation request ever happend, we need to create the VMC DB first, via initialize_sqlite_database_file[7]. This very first thing it does is to call read_rpdata[8]. As its file name implies, the SQLite DB accesses hardware (NV) to get the hash of the DB at certain epoch, which is essentailly a HW MC. We are getting close to the real HW…

6. PSDA_Service[9]

This file contains all functions used by the VMC SQLite DB to create/delete/update HW MC in the so-far-unknown HW via invoke_psda_service. An ephemeral session is established in the function to provide secure and authenticated communication with the mystery HW, through psda_invoke_service_ocall[10], and eventually psda_invoke_service[11].

7. Intel ME

Just looking at the psda_invoke_service is really insightful:

    JVM_COMM_BUFFER commBuf;
    commBuf.TxBuf->buffer = psda_req_msg;
    commBuf.TxBuf->length = psda_req_msg_size;
    commBuf.RxBuf->buffer = psda_resp_msg;
    commBuf.RxBuf->length = psda_resp_msg_size;
    int response_code;
    ae_error_t ret;

    ret = PSDAService::instance().send_and_recv(

JVM? JHI? Finally, we have reached the dark side of the MC. Somehow some Java shit involves in communicating with the secret HW[13] — Yes, it is Intel ME! JHI stands for Java Host Interface, which is used to install Java applets into Dynamic Application Loader (DAL) firmware (inside ME) and communicate with them[12]. Yes, the SGX MC is essentially implemented in ME as a Java applet called “PSDA”!

8. PSDA.dalp

PSDAService::install_psda[14] installs the psda applet into ME using JHI. Not surprisingly, the source code for psda is not available. Instead, the script is used to download the applet binary directly, since it needs to signed by Intel — Right! Intel does not want anyone to install any applet into ME, for security reasons!!! After running the download script, PSDA.dalp will appear under psw/ae/data/prebuilt. It is encoded as Base64:


Copying out the Base64 encoded applet and running decoding, we have a chance to peek into the implementation of the applet:
                                  CodeWrapper	Constants#Constants$CSE_PLATFORM_CAPABILITIES!Constants$CSE_SERVICE_RESP_STATUS!Constants$EPHEMERAL_SESSION_STATEConstants$LT_SESSION_STATEConstants$MEDIAPATH_IDonstants$PSDA_COMMARPDataServiceServiceFactoryServiceMessageFactoryTrustedTimeServiceServiceHandlerPSDACommonMessage$PSDA_SERVICE_ID
EphSessionEphSessionManager	LTSessionLTSessionManager
                                                                     DALApiProxy	DebugUtil	FlashUtilSGXMonotonicCounterProxyCryptoException	EpidAlgExHashAlgIllegalParameterExceptionNotSupportedExceptionOperationFailedExceptionRandom
UtilExceptionimeExceptionObjectString                 IntelAppletSEMonotonicCounter
StringBuilder   Throwable            StringBuffer
pairing_sk	payloadIv
timerEpochversionwrappedEncryptKeabandonSessionclearcreateServiceCommandcreateServiceMessage	deriveIDseraseAllFlashData
                                                                                                             printreadFlashDatareadInreadLongonprocessLTSessioreadByte	readBytes
readObjectreadPairingCounter	readShortresetresponseWithMessageseekset!setFlashStoragePropertyAntiReplay
                                                                                                          setResponse	setSeqNumsetTimesigmaSetPropertyAppendSVNsizestartEphemeralSessionstartLongTermSession
               updateRPData	writeByte
writeByteswriteFlashData_0writeFlashData_writeInt	writeLong

9. Intel ME Again

Although we only looked at the MC creation in SGX, this post has been long enough. Last thing about SGX MC is still about Intel ME. If you have seen dmesg like “Device doesn’t have valid ME Interface”, and neither /dev/mei nor /dev/mei0 is created, it is very likely that you are running a server platform with a different ME firmware (SPS) comparing to desktops/laptops[15], and the corresponding DAL firmware is not available in SPS. This means SGX MC is NOT available for servers!!! Well, I guess this does not really matter as people are in a rush to disable ME anyway… On the other hand, implementing SGX MC using ME seems to be the most reasonable choice (at least, we are still inside the CPU package), although this makes “SGX defends against ME attacks” statement in vain.



Posted in Security | Tagged , , , | 12 Comments

Why list and charlist are confusing in Elixir

This post talks about list and charlist in Elixir, and discusses one specific issue in Elixir.

Let’s look at the confusing thing in Elixir:

iex(2)> a = [7]
iex(3)> b = ‘\a’
iex(4)> a == b
iex(5)> is_list(a)
iex(6)> is_list(b)

First, a list is shown as a charlist in iex; second, (without surprise if we believe in iex), the charlist equals the list; last, both the list and the charlist are lists.

Why? Because charlist in represented as a list with code point(s) inside, and those encoding scheme is UTF-8 (rather than UTF-16 in Erlang). To convince yourself:

iex(7)>, fn x -> [x] end)
[[0], [1], [2], [3], [4], [5], [6], ‘\a’, ‘\b’, ‘\t’, ‘\n’, ‘\v’, ‘\f’, ‘\r’,
[14], [15], [16], [17], [18], [19], [20], [21], [22], [23], [24], [25], [26],
‘\e’, [28], [29], [30], [31], ‘ ‘, ‘!’, ‘”‘, ‘#’, ‘$’, ‘%’, ‘&’, ‘\”, ‘(‘,
‘)’, ‘*’, ‘+’, ‘,’, ‘-‘, ‘.’, ‘/’, ‘0’, ‘1’, …]

Looks familiar? Yes, those are ASCII codes. So the conclusion seems to be that list and charlist are the same in Elixir. Well, this is mostly right, as we have seen, but not exactly right. Run the file below to inspect into list and charlist internal using pry:

require IEx

value_list = [7]
value_list2 = [1]
value_char = '\a'


daveti@Daves-MacBook-Pro:~/elixir/list_issue$ iex list_issue.exs
Erlang/OTP 20 [erts-9.0] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:10] [hipe] [kernel-poll:false] [dtrace]

Request to pry #PID<0.83.0> at list_issue.exs:7

5: value_char = ‘\a’
7: IEx.pry

Allow? [Yn]
Interactive Elixir (1.5.1) – press Ctrl+C to exit (type h() ENTER for help)
pry(1)> i value_list
Data type
This is a list of integers that is printed as a sequence of characters
delimited by single quotes because all the integers in it represent valid
ASCII characters. Conventionally, such lists of integers are referred to as
“charlists” (more precisely, a charlist is a list of Unicode codepoints,
and ASCII is a subset of Unicode).
Raw representation
Reference modules
Implemented protocols
IEx.Info, Collectable, Enumerable, Inspect, List.Chars, String.Chars
pry(3)> i value_list2
Data type
Reference modules
Implemented protocols
IEx.Info, Collectable, Enumerable, Inspect, List.Chars, String.Chars
pry(5)> i value_char
Data type
This is a list of integers that is printed as a sequence of characters
delimited by single quotes because all the integers in it represent valid
ASCII characters. Conventionally, such lists of integers are referred to as
“charlists” (more precisely, a charlist is a list of Unicode codepoints,
and ASCII is a subset of Unicode).
Raw representation
Reference modules
Implemented protocols
IEx.Info, Collectable, Enumerable, Inspect, List.Chars, String.Chars

Now you see the difference between list and charlist. Both have data type list, which explains why they are equivalent. Moreover, when the numbers inside the list could be interpreted as valid ASCII code points, Elixir shows it as a charlist rather than a list, although the raw presentation is always a list.

Why this is confusing? Let’s say you wanna print out a list, e.g., [7], but what Elixir displays is ‘\a’. Although it does not affect the internal computation, e.g., map, this display issue is really annoying and confusing. In this case, Erlang seems doing sane:

Eshell V9.0 (abort with ^G)
1> [7].
2> ‘\a’.
3> [7] == ‘\a’.

Posted in Programming | Tagged , , , , | Leave a comment

A PoC of DoS attack in Elixir Actor Model

The naive way of using the Actor model in Elixir is using “receive” in a loop, which is then “spawn”d as a Erlang process. Unfortunately, a potential DoS attack could happen if the pattern matching is not coded carefully with security in mind. This post looks into this potential vulnerability, and provides a PoC implementation.

A typical “receive” block in Elixir provides different pattern matchings, like below:

receive do
      {sender_pid, {:msg, _msg}} ->
        send(sender_pid, {:ok, "gotcha"})
      _ ->
        IO.puts "WTF"

Following the same order of matchings, the message goes through each matching one by one till finding a match. Otherwise, “_” (matching all) catchs whatever received. While this defensive matching all seems optional, it is vital to the security of the Erlang VM. As turns out, the message would stay in the mailbox forever, if no matching is able to “catch” the message. The potential result is DoS attack, caused by OOM (out of memory) of the mailbox.

To observe this potential DoS attack, please download the PoC ( As shown below, unmatched messages keep accumulating in the mailbox (message queue) when running edos:

msg 24645 sent
worker msg queue len {:message_queue_len, 24645}
msg 24646 sent
worker msg queue len {:message_queue_len, 24646}
msg 24647 sent
worker msg queue len {:message_queue_len, 24647}
msg 24648 sent

“top” also shows the memory usage (of beam.smp) keep increasing as more messages are sent to the mailbox. This DoS could turn into DDoS easily in a distributed environment.

As usual, Elixir compiler does not complain about this during the compilation time, although it would be really useful for the compiler to give some warnings. Nothing is gonna happen in runtime as well, except the memory consumption of the mailbox, which is “invisible” to the user. Even worse, dialyxir also reports nothing regarding this potential issue.

One more thing needs to verify is whether a GenServer implementation without handle_info/2 would also cause DoS in the mailbox. I suspect it the same as above.

Posted in Programming, Security, Static Code Analysis | Tagged , , , , | Leave a comment