Intel SGX CPU (staring from Skylake) has been there for while. The good news is that there is still no known exploitation against SGX self yet, though there are some exploitations in the enclave code and Intel SGX SDK. In general, SGX is still believed to provide strong security guarantee for the data/code in the enclave. If there is really something messed up in SGX, it has to be the CPU logic/mirocode. This post tries to peek into a specific bug reported by Intel for its SGX CPU implementation. Moving forward, we invesitgate the possible mitigations and add new features into the well-known Intel platform security tool chipsec. Cheers.

1. SKL012

In the spec update [1] released by Intel on Sep 2016, there are 6 CPU bugs related with SGX in general [2]. None of them is treated seriously by Intel thus no fix is planed for any of it. Among them, we especially look at SKL012:


The SMSW Instruction May Execute Within an Enclave


The SMSW instruction is illegal within an SGX (Software Guard Extensions) enclave, and an attempt to execute it within an enclave should result in a #UD (invalid-opcode exception). Due to this erratum, the instruction executes normally within an enclave and does not cause a #UD.


The SMSW instruction provides access to CR0 bits 15:0 and will provide that information inside an enclave. These bits include NE, ET, TS, EM, MP and PE.


None identified. If SMSW execution inside an enclave is unacceptable, system software should not enable SGX.


For the steppings affected, see the Summary Table of Changes.

My interpretation for this SKL012 bug is that SGX was designed to not allow SMSW instruction in an enclave, though the instuction could be executed by ring-0 and ring-3. Unfortunately, due to this bug, SMSW “may” be executed in the enclave.

2. So what the fuss?

SMWS is one of the secuirty sensitive instructions, which can be run by ring-3 and reveal some security sensitive information of the platform and the kernel to the user space. These instructions are usually leveraged by malware to detect the VM enviornment or exploit the system/kernel configuation [3][4]. Similar with SMWS, other security sensitive instructions include SGDT, SLDT, SIDT, and STR. All these instructions could be run by ring-3 code without any problem. And what about these instructions for SGX besides the SMWS mentioned in SKL012.

3. Verify the bug(s)

Here we implement a tool called sgxbug [7] based on the enclave creation sample code provided by Intel SGX SDK for Linux [8] by adding the SMWS instruction into the enclave and retrieving the results from the application. We also covered other 4 security sensitive instructions mentioned above. For implementation detailes, please check out the commit ( For normal ring-3 testing code without SGX, one could refer to [6] for details.

Here is the result of running sgxbug app:

root@sgx2-HP-ENVY-x360-m6-Convertible:~/git/sgxbug# ./app 0
Got senstive instruction idx: 0
Checksum(0x0x7ffe4ba35c30, 100) = 0xfffd4143
Info: executing thread synchronization, please wait…
Start sensitive instruction testing…
GDT: limit=0127, base=ffff880273c49000
Sensitive instruction testing done…
Info: SampleEnclave successfully returned.
Enter a character before exit …

root@sgx2-HP-ENVY-x360-m6-Convertible:~/git/sgxbug# ./app 1
Got senstive instruction idx: 1
Checksum(0x0x7ffd677a11f0, 100) = 0xfffd4143
Info: executing thread synchronization, please wait…
Start sensitive instruction testing…
IDT: limit=4095, base=ffffffffff578000
Sensitive instruction testing done…
Info: SampleEnclave successfully returned.
Enter a character before exit …

root@sgx2-HP-ENVY-x360-m6-Convertible:~/git/sgxbug# ./app 2
Got senstive instruction idx: 2
Checksum(0x0x7fff399040a0, 100) = 0xfffd4143
Info: executing thread synchronization, please wait…
Start sensitive instruction testing…
LDT: ffffffffffff0000
Sensitive instruction testing done…
Info: SampleEnclave successfully returned.
Enter a character before exit …

root@sgx2-HP-ENVY-x360-m6-Convertible:~/git/sgxbug# ./app 3
Got senstive instruction idx: 3
Checksum(0x0x7fffb1b31300, 100) = 0xfffd4143
Info: executing thread synchronization, please wait…
Start sensitive instruction testing…
MSW: ffffffffffff0033
Sensitive instruction testing done…
Info: SampleEnclave successfully returned.
Enter a character before exit …

root@sgx2-HP-ENVY-x360-m6-Convertible:~/git/sgxbug# ./app 4
Got senstive instruction idx: 4
Checksum(0x0x7ffcfe43bee0, 100) = 0xfffd4143
Info: executing thread synchronization, please wait…
Start sensitive instruction testing…
TR: ffffffffffff0040
Sensitive instruction testing done…
Info: SampleEnclave successfully returned.
Enter a character before exit …


(Maybe not) Surprisingly, only SMWS is working smoothly but also SGDT, SLDT, SIDT, and STR. In summary, there is no any limitation for code in the enclave to execute these security sensitive instrucitons.


Now what? It turns out the latest Intel CPU has a thing called UMIP – User Mode Instruction Prevention [5], which as its name implied can block some security sensitive instructions running at ring-3. The blocked instructions currently include all the 5 instructions mentioned above. That means, once UMIP is enabled, the user-space application (or malware) is not able to run these instructions anymore. This is really good in my opinion and I would recommend enabling this if possible to reduce the attack surface from the user space. Due to this reason, we add some new features to the CHIPSEC tool [9], detecting the UMIP feature, checking if the feature enabled, and enabling the UMIP for all cores if possible. For implementation details, please refer to the commit ( and the commit (

5. What if…

With the knowledge of UMIP, we can start thinking what would happen for SKL012 when UMIP is enabled. We have verified that without UMIP, all these instructions work in the enclave. Will they work again when UMIP is enabled? Unfortunately, my SGX CPU is apparently not “latest” enough to have such a feature:

root@sgx2-HP-ENVY-x360-m6-Convertible:~/git/chipsec# chipsec_util cpu umip detect

### ##
### CHIPSEC: Platform Hardware Security Assessment Framework ##
### ##
[CHIPSEC] Version 1.2.5
****** Chipsec Linux Kernel module is licensed under GPL 2.0
[CHIPSEC] API mode: using CHIPSEC kernel module API
[CHIPSEC] Executing command ‘cpu’ with args [‘umip’, ‘detect’]

[cpu] CPUID out: EAX=0x00000000, EBX=0x029C67AF, ECX=0x00000000, EDX=0x00000000
[CHIPSEC] UMIP available for CPU: False
[CHIPSEC] (cpu) time elapsed 0.000

Here is what I imagined: Because of SKL012 bug in SGX, it is possible that UMIP could not prevent the execution of these security sensitive instructions within the enclave. If this is true, with UMIP enabled, malware would need to use SGX to guarantee the exeution of these (e.g., VM detection).



Posted in Security | Tagged , , , , , , | Leave a comment

getdelays – get delay accounting information from the kernel

Top may be the most common tool in use whenever a preformance issue is hit. It is simple, quick and dumb. Besides the heavy metal stuffs like perf and gprof, another really useful and simple tool is getdelays, which provides the latency statistics per process/task for CPU, memory, and I/O.

1. Where to get it
As mentioned in the comment, need to compile it with:

gcc -I/usr/src/linux/include getdelays.c -o getdelays

Since it uses the netlink socket, it requires root permission to run as well.

2. What it does

Getdelays does a simple job – creating a netlink socket, sending a request to the kernel for reading the task statistics, and printing out the reply. Essentially, this netlink socket exposes the kernel taskstats structure to the user space. For more information about taskstats struct, please refer to

3. How it looks like


In the example above, it shows the delay information for httpd, which seems working fine without any memory or I/O issues, except some minor delays from CPU since it is a background process rather than an interactive shell. If the application has shown some latency issues, getdelays should be able to show some numbers in “delay total” and “delay average”, which should be helpful to limit the scope of the performance issue to CPU, memory or I/O.

4. Note

Cannot expect too much from getdelays, which simply prints some counts in the kernel, and should be enough to know where the problem would be. To find the performance bottleneck, strace/ltrace/dtrace/lttng/perf/gprof should be considered as the next step.

Posted in OS, Programming | Tagged , , , , , | Leave a comment

Making USB Great Again with USBFILTER – a USB layer firewall in the Linux kernel

USENIX Security '16

Our paper “Making USB Great Again with USBFILTER” has been accepted by USENIX Security’16. This post provides a summary of usbfilter. For details, please read the damn paper or download the presentation video/slides from USENIX website. I will head to TX next week, and see you there~

0. Why USB is not great anymore?

We CANNOT trust a USB device from its appearance anymore. One of the typical BadUSB attacks is a USB drive with a keyboard functionality to inject malicious script into the host machine once plugged into. The root cause of the problem is that (almost) everyone can change the USB device firmware to add new functionalities as desired. And people would just plug in the USB flash drives found somewhere for curiorsities (“Users Really Do Plug in USB Drives They Find” Oakland’16). Even worse, this also puts enterprise infrastructure in danger – however powerful the networking firewall would be, a suspecious USB device used by an employee can turn everything into vain. As a result, enterprise settings usually forbid the usage the external USB devices except the original keyboards/mouses. For most normal users, we just ignore these threats or try to plug in unknown USB devices into someone else’s machines…(this is how friendship breaks). Note that cellphones are also USB devices, and what would you do when someone needs to charge his/her phone using your machine?

1. Our solution – usbfilter

The more we play with USB, the more we realize that it is just another transport protocol for USB devices, like TCP/IP for networking devices. Moreover, it is USB packets trasmitted between the USB host controller and devices. Inspired by the netfilter in the Linux kernel, we then made up something like below, and all we need is to make it work.


2. The design and implementation of usbfilter

One of the key features of usbfilter is its ability to trace the USB packet back to its originating process. This is non trivial. For instance, because of the generic block layer and the I/O scheduler within the kernel, all USB packets operating (read/write) the USB storage devices are handled by the usb-storage kernel thread for performance considerations. Similarly, USB networking devices usually have their own Rx/Tx queue to buffer skb (IP packet) before it is encapulated by the USB stack in their drivers. Because usbfilter works at the lowest level of the USB abstraction in the kernel, the pid it can sees usually is either from a kernel thread (device drivers) or an IRQ context (null). As one can imagine, we hacked into different subsystems of the kernel, and saved the originating pid into the urb (USB packet) before it was lost due to asynchronized I/O. Once we fix that, we have a more concrete picture of usbfilter:


Now all we need to do is to implement a user-space tool, which is called usbtables, to communicate with the usbfilter component in the kernel, and enforce rules/policies pushed from the user-space. To make sure no conflictness/contradictiveness within rules, usbtables also has an internal Prolog engine to reason about each new rule before it is pushed into the kernel.


3. So what can usbfilter do?

Here is the fun part. We list a bunch of cool use cases here. For a complete list of case studies, please refer to our paper. In general, just like iptables, with the help of usbtables, users can write rules to regulate the functionalities of USB devices.

A Listen-only USB headset

usbtables -a logitech-headset -v ifnum=2,product=
      "Logitech USB Headset",manufacturer=Logitech -k
      direction=1 -t drop

A Logitech webcam C310, which can only by used by Skype

usbtables -a skype -o uid=1001,comm=skype -v
      serial=B4482A20 -t allow
usbtables -a nowebcam -v serial=B4482A20 -t drop

A USB port dedicated for charging

usbtables -a charger -v busnum=1,portnum=4 -t drop

There are 2 possible settings for these rules, since users can use usbtables to set the default action when no rule matched. If the default action is DROP, users can use usbtables to add a whitelist, permitting certain devices with certain functionalities. This provides the strongest security guarantee since each USB device needs at least one rule to work. If the default action is ALLOW, users have to use usbtables to add a blacklist, blocking undesired functionalities from certain devices. This is less secure but provides the best usability.

4. What is LUM?

If you look at the usbfilter architecture figure again, you will notice a thing called usbfilter modules or Linux usbfilter modules (LUM). This is another powerful feature of usbfilter. Just like netfilter, usbfilter enables kernel developers to write kernel modules to look into and play with the USB packet as wished, plug in them into the usbfilter, and enable new rules using these kernel modules. Check out the example LUM in the code repo to detect the SCSI write command within the USB packet ( With the help of this LUM, one can write rules to stop data exfiltration from the host machine to a Kingston USB flash drive for user 1001:

usbtables -a nodataexfil2 -o uid=1001
      -v manufacturer=Kingston
      -l name=block_scsi_write -t drop

With default to block any SCSI write into any USB storage devices, a whitelist can help permit a limited number of trusted devices in use while preventing data exfiltration when an unknown USB storage device is plugged into.

5. Todo…

There is still a long way before the usbfilter can be officially accepted by the mainline. Applications may hang forever waiting for the response USB packet, whose request USB packet has been filtered by usbfilter, though this could be an implementation issue of applications. Some USB devices can also be stale in the kernel even if they have been unplugged already, if the USB packet used to release the resource is also filtered. Even though usbfilter has introduced a minimum overhead, using BPF may be mandatory for it to be accepted by the upstream.

6. Like it?

To download the full paper, please go to my publication page. The complete usbfilter implementation, including the usbfilter kernel for Ubuntu 14.04 LTS, the user-space tool usbtables, and the example LUM to block writings into USB storage devices are available on my github: Any questions, please go ahead to open an issue in the code repo, and I will try my best to answer it in time.

Posted in Dave's Tools, OS, Security | Tagged , , , , , , , , , , , | Leave a comment

Fedora Upgrade from 21 to 24

After almost 5 hours of upgrading, my server has been successfully upgraded from Fedora 21 to Fedora 24, which uses the latest stable kernel 4.6. There is a online post demonstrating how to upgrade from Fedora 21 to 23 using fedup. This post talks about Fedora upgrading from 21 to 24 using dnf. NOTE: please do backup your data before action!

0. yum update

This is usually not a problem for Fedora 21, whose support has expired for a long time. Anyway, run it just in case.

1. dnf

According to the Fedora official wiki (, dnf is recommed for system upgrade. Apparently, fedup has been ditched. Here what we need are 3 dnf commands:

sudo dnf upgrade --refresh
sudo dnf install dnf-plugin-system-upgrade
sudo dnf system-upgrade download --refresh --releasever=24

The last dnf command should list any error, which blocks the upgrade. The errors I have encountered were obsolete packages which are not supported in Fedora 24 repo. As you can tell, the only way to move the upgrade is to remove all these obsolete packages, using “yum remove” + unsupported package name reported by dnf.

Once all the errors are cleaned, dnf is able to download all the required packages for Fedora 24. On my server, it was about 4GB. So, you need at least some GB left to hold all these new packages. More important, dnf requires another 5GB under root during the package installation. Make sure you make dnf happy.

2. Keys

Before dnf was able to install all new downloaded packages, I got such an error:

Couldn’t open file /etc/pki/rpm-gpg/RPM-GPG-KEY-fedora-24-x86_64

There is a bug report talking about the possibilities of this issue and corresponding fixes ( However, if you find manual key importing does not work, go and take a look at /etc/pki/rpm-gpg directory. What happened to my server was simply no any key file for Fedora 24. Oops. The fix is also easy – creating the key files by ourselves. Go to and find the key files (primary/secondary). Create these key files and symlink the x86_64 (arch of my server) with the primary. That’s it.

3. dnf again

Reboot the machine to start the upgrade:

sudo dnf system-upgrade reboot

Hint: yum is now deprecated. Run “dnf update” once you are into the new system.

Posted in Linux Distro | Tagged , , , | 4 Comments

Malware Reverse Engineering – Part II

While most tools for MRE are staightforward, some of them require time, patience, and skills to show the full power. For static analysis, this means IDA; for dynamic analysis, it is OllyDbg (and WinDbg for Windows kernel debugging). In this post, we will play disassembly code heavily with both tools. Remember – the key point of MRE is not to fully understand every line of disassembly, but rather to construct a big picture of the malware in a high-level programming language, e.g., C/C++. If you have a Hex-Rays decompiler already, use it to make your life easier. Otherwise, read this post.

0. Report header

Apr 11, 2016. GNV, FL.

1. Download the malware – play with your own risk!

Git clone my git repo ( and copy the malware_g.7z into the Windows VM. NOTE: there is not password protection for this malware.

2. Summary

This malware G and the accompanied jellydll.dll are a proof-of-concept GPU-based rootkit  called JellyCuda ( It leverages the Nvidia GPU non-volatile memory to hide the malicious jellydll.dll and make it persistent without being detected by scanning the hard disk of the host machine. When the host is infected by the JellyCuda the first time, it loads the jellydll.dll into the GPU memory, creates a file called jellyboot.vbs in the startup folder and writes itself into the pre-formated VBscript, making sure that the malware would run every time when the machine is booted, and finally the jellydll.dll is removed. After the machine is rebooted, the malware looks for the jellydll.dll. If the dll file is still available, the malware would repeat the previous procedure to hide the malicious DLL file in the GPU memory. Otherwise, the malware reads the GPU memory, finds the memory block containing the jellydll.dll contents, reconstructs the DLL file in the memory, replaces the current process memory with the contents of the DLL, and finally calls the DllMain() entry function of the jellydll.dll, which simply prints out warnings of the existence of the GPU RAT.

Since this is a proof-of-concept malware, specific signatures or remediations for this malware may not be interesting or useful. However, JellyCuda does give us some hints to think about GPU-based rootkit in general:

  1. Calls to CUDA/OpenCL – normal applications usually do not deal with GPU directly.
  2. cuMemAlloc, cuMemcpyHtoD, cuMemcpyDtoH (or the OpenCL equivalents) – this means there is memory block transmission between the main RAM and the GPU memory.
  3. New file created – either the registry and/or the startup folder or the prefetch folder may be changed to include the malware itself, making sure it persistent across rebooting.

To remove JellyCuda from the system, one needs to clean the residency in the GPU memory at first, position the malware itself based on the modified registry/startup/prefetch, and remove it. The good news is that my Avast is able to recognize the JellyCuda as malware when I tried to copy it into the VM for analysis on my Mac.

NOTE: this report focuses on IDA and OllyDbg analysis, rather than other straight-forward tools. IDA analysis shows the complete picture of the malware, and OllyDbg digs into the malicious payload (jellydll.dll), which could not be analyzed by IDA.

3. Static Analysis

  • Is it packed?

No, though PEiD shows a packer named Pelles C for this malware, but it is the compiler which compiles the binary, not the packer.


And nothing found for the accompanied dll:


  • Compilation data?

Malware_g.exe: 2015/05/09.


Jellydll.dll: 2015/05/09


  • GUI or CLI?

Malware_g.exe: PEiD thinks it is a Win32 GUI and PEview thinks the same way.


jellydll.dll: PEiD reports it as Win32 GUI and PEview agrees.


  • Imports?




File manipulation:

CreateFile, WriteFile, CloseHandle, GetFileSize, ReadFile, DeleteFile, GetFileAttributes, GetFileType, GetStdHandle, DuplicateHandle, SetHandleCount,

Memory manipulation:

VirtualAlloc, GlobalAlloc, HeapAlloc, GlobalFree, HeapCreate, HeapDestroy, HeapReAlloc, HeapFree, HeapSize, HeapValidate, VritualQuery

Process manipulation:

GetProcAddress, GetModuleHandle, GetProcessHeap, GetModuleFileName, GetCurrentProcess, ExitProcess,

Library manipulation:

LoadLibrary, FreeLibrary,


Strlen, strcat, GetLastError, GetStartupInfo, RtlUnwind, GetSystemTimeAsFileTime, GetCommandLine, GetEnvironmentStrings, FreeEnvironmentStrings, UnhandledExceptionFilter, WideCharToMultiByte, SetConsoleCtrlHandler


MessageBox, wsprintf, ExitWindowsEx


OpenProcessToken, LookupPrivilegeValue, AdjustTokenPrivileges









File manipulation:

GetFileType, GetStdHandle, DuplicateHandle, SetHandleCount,

Memory manipulation:

VirtualAlloc, VirtualFree, HeapCreate, HeapDestroy, HeapReAlloc, HeapFree, HeapSize, HeapValidate, VritualQuery

Process manipulation:

GetCurrentProcess, ExitProcess,


GetStartupInfo, GetSystemTimeAsFileTime, GetCommandLine, GetModuleFileName, GetEnvironmentStrings, FreeEnvironmentStrings

  • Strings?


Process: svchost
Jellyboot.vbs, malware_g.exe

Files generated by the compiler:





Error handling:














  • Sections and contents?

malware_g.exe: there are 3 sections in total

.text: it looks like there is code in it.

.rdata: Warning strings, windows commands, CUDA functions, and interesting stuffs


.data: IAT, and a bunch of debug sections, including COFF



Jellydll.dll: there are 4 sections.

.text: normal code

.rdata: malware writer’s kind reminder


.data: IAT


.reloc: relocation table

(g) Resource

ResourceHacker found nothing for either the malware_g.exe or jellydll.dll.

(h) IDA Pro


The first entry function of malware_g.exe is WinMainCRTStartup(), which is generated by the Pelles C compiler for Windows.


It sets up an exception handler, which calls RtlUnwind(), which is usually generated by the compiler for try/except. It then moves to allocate space on the heap using HeapCreate() called by __bheapinit(). If failed, then exit. Otherwise, system setting up continues.



If everything is still good, we reach the second entry function WinMain(), which is the real function implemented by the malware.


The first thing WinMain() tries to do is to call LoadCuda().


If the loading is failed, the malware exits. Otherwise, it continues with a call to dword_40595C, dword_405958, dword_405954, and jc. Since all these are indirect calls, we need to figure out what these memory address are by looking into the LoadCuda().


As its name implies, LoadCuda() starts with loading nvcuda.dll using LoadLibrary(), and exits if the loading fails.


When nvcuda.dll is successfully loaded, memory address jc is loaded into %eax and then the local variable lpAddress. Looking at that memory address, we realize the connections among all those memory addresses. Jc is the start address of a struct with address 0x405950, and dword_405954, dword_405958, dword_40595C, …, dword_40596C are the following members of the struct. Since all members are dword (4 bytes) and called by the call instruction, this jc struct contains a bunch of function pointers.



Once jc is loaded into lpAddress, a loop starts on szFuncNames array. For each name in szFuncNames, GetProcAddress() is called with the library handle returned by LoadLibrary() and the name. The return value is assigned to the current value of lpAddress.


Looking into the szFuncNames, we see the CUDA functions we have seen in the strings.


Once LoadCuda() is done. Struct jc is initialized with all these CUDA functions in order. So back to the WinMain(), after LoadCuda() is successfully returned, cuInit(), cuDeviceGetCount(), cuDeviceGet(), cuCtxCreate_v2() are called one by one. Any call failure would free the loaded CUDA library and exit the malware. When CUDA is successfully initialized, GetFileAttributes() is called with jellydll.dll and the return value is checked against 0xffffffff (-1), which is INVALID_FILE_ATTRIBUTES. GetLastError() is called and the return value is checked against 2, which is ERROR_FILE_NOT_FOUND. When both errors happen, SearchJellyDustOnGPU() is called; otherwise, SprayJellyDustToGPU() is called. Then FreeLibrary() is called and WinMain() returns.


SearchJellyDustOnGPU() calls AllocateGPUMemory() at first, which calls dword_405960, which is essentially the 5th member of struct jc – cuMemAlloc_v2().


If AllocateGPUMemory() failed, SearchJellyDustOnGPU() would exit. Otherwise, it continues calling GlobalAlloc(), dword_405964 (cuMemcpyDtoH_v2()), dword_40596C (cuMemFree_v2()), which copies the GPU into the host memory. Note that the copied memory size is expected to >= 0x1000C (65548) bytes.


The copied memory is then examined against a number 0x5DAB355 in a loop.



If the memory blocks starts with the magic number, and some checkings are passed, and GetDustCheckSum() is passed as well, we hit the core of this SearchJellyDustOnGPU() – GetProcessHeap(), HeapAlloc(), and ExecuteJellyDust(). Note that the ‘rep movsb’ copies the memory block we found with offset 0xC into a local variable lpvDust, which is then passed into ExecuteJellyDust().


The ExecuteJellyDust() function calls VirtualAlloc(), LoadLibrary(), and GetProcAddress() in a big loop. Based on the naming of local variables involved – pImport and pRelocBase, one can guess that this loop is used to reconstruct a library from the memory block. Finally, ExecuteJellyDust() loads ntdll.dll and calls NtFlushInstructionCache(), which parameters (-1, 0, 0), which is undocumented, and clears the old code in the cache. Finally, an indirect call to %eax is made with parameters (lpvTarget, 1, 0). Note that %eax is derived from pNt with offset 0x28, which is the offset of DllMain() against the PE signature. So, we know that final call is to call the entry function of the library created in the fly before. Now the question is what is that library?


The last function we haven’t looked at is SprayJellyDustToGPU(), which is called when the malware is able to find the jellydll.dll. The only parameter of this function is “jellydll.dll”. First, it calls CreateFile() to open jellydll.dll, and GetFileSize(). Then GetProcessHeap() and HeapAlloc() are called to allocate enough memory for jellydll.dll, which is then read into the memory via ReadFile(). AllocateGPUMemory() is called after followed by GetDustCheckSum() and GlobalAlloc(). Note that the magic number 0x5DAB355 is added ahead of the memory block of jellydll.dll.


The JellyDust (magic number + tweak(jellydll.dll)) is then copied into the memory allocated by the GlobalAlloc(), and later copied into the GPU memory via dword_405968 (cuMemcpyHtoD_v2()).


At last, file jellydll.dll is closed and deleted via CloseHandle() and DeleteFile(), before the Reboot() is called, which is the last piece of the malware_g.exe puzzle. This function calls SHGetKnownFolderPath() to open _FOLDERDIR_Startup, which is %APPDATA%\Microsoft\Windows\Start Menu\Programs\StartUp.


The startup path is then converted from wide chars into multiple bytes using wcstombs(), appended with byte 0x5C (‘\”), and null terminated.


Then file jellyboot.vbs is created under than startup direction.


After the new file is created, GetModuleFileName() is called to get the file path of the malware_g.exe itself. The jellyboot.vbs file then is written via WriteFile() with command lines formated by wsprintf() using the file path of the malware itself, and finally closed via CloseHandle(). The command lines are used to create a COM object using VBscript to run the malware itself and then remove itself.


The last thing Reboot() does is to call GetCurrentProcess(), OpenProcessToken(), LookupPrivilegeValue(), and AdjustTokenPrivileges() to gain the permission to reboot the machine using ExitWindows().



Now we know that jellydll.dll is the RAT, and the DllMain() entry function would be executed by the malware_g.exe. However, IDA screws the analysis of this library. The dll entry function tries to call sub_10001030, which is the address in the .rdata section.



4. Dynamic Analysis


We are not able to run malware_g.exe, not only because of the CUDA requirement, but also the fact that below procedure could not be located. Why? This function is only available above Windows Vista.



To see what the heck jellydll.dll is doing in its DllMain() entry function, we load jellydll.dll into OllyDbg, which asks if we want to load LoadDLL.exe to run the library. After yes, we finally see the RAT.


Then we break at the new module loading time and find the exact DllMain entry function, which is at 0x7C901187.


Then we break at the DllMain() function to examine the stack. %esp is 0x0006F8AC, and %ebp is 0x0006F8C4. The first parameter of the function is at the top of the stack, which is address 0x0006F8AC. The second parameter is address 0x0006F8B0. The third parameter is address 0x0006F8B4. The function call is ss:[ebp+8], which is address 0x0006F8CC.


Moving on to look back at the stack, we have:

First parameter (hinstDLL) – 0x0006F8AC: 0x10000000 –  should be the handle to the loaddll.exe itself.

Second parameter (fdwReason) – 0x0006F8B0: 0x00000001 – that is the REASON code DLL_PROCESS_ATTACH.

Third parameter (lpvReserved) – 0x0006F8B4: 0x00000000 – NULL for dynamic loads.

Function call – 0x0006F8CC: 0x10001140 – that is the correct address of DllEntryPoint() shown in IDA.

There we go, let us step into the DllMain(). The real function call in the DLL entry is at address 0x1000117E, with an instruction “call 10001000”. So break at this line again and examine the stack.


Now interesting thing happens. When we try to set a breakpoint at the address, OllyDbg tells us that we are looking at the code in the data section rather than the code section, which may explains why IDA screws. Anyway, set the breakpoint and step into.


We finally see the final function called in the DllMain() of the jellydll.dll. It is a call to MessageBox with the capital string and the RAT string.


5. Indicators of compromise

Since this is a proof-of-concept of GPU-based malware, it is easy to know the machine is compromised when the warning window shows up. In reality, the indicator could be non-trivial to find, depending on the implementation of the GPU payload (jellydll.dll). If it is a rootkit, it may stay in the machine for a long time without detection, and even AV may not help. If it is a RAT, we may be able to find unfamiliar socket connections with outside. If it is a ransomware, we know when we know.

6. Disinfection and remedies

It is not clear so far what the best solution would be for GPU-based malware (and I am going to dig deeper to see if there would be a paper potential). Since current prototypes of GPU-based malware require a ‘helper’ in the host system to make it work, Intel does not think it would be threat ( On the other hand, my Avast on Mac is able to detect the JellyCuda when I tried to move it into the VM for analysis. As far as I can think of now is a system tool/mechanism to look into the GPU memory for malware detection just like AV does on the host machine. We may also reconsider the access control for the GPU from the security point. Yeah, I am talking about the pitch of a potential paper trying to defense GPU malware. Will see how it goes:)

Posted in Security, Static Code Analysis | Tagged , , , , , , , , , | Leave a comment

Malware Reverse Engineering – Part I

I took a “Malware Reverse Engineering (MRE)” class last semeter and it was fun to me, partially because I was not a Windows person, though I am still not. What seems ridiculous to me is how trivial one can write into any process on Windows XP, which was apparently designed for malware! Regardless of all those Windows craps, this post is to share a general working flow of malware reverse engineering on Windows (XP) platform, and the corresponding tools. Note that this report has no way to be a good one. Instead, this was my first trial and I intended to put as much information as possible. If you are interested in MRE or wants a job for that, do buy this book ( and give it a complete read. Have fun and stick with Linux~

0. Report header

Feb 12, 2016, GNV FL.

1. Download the malware – play with your own risk!

All the malware samples can be found on my github ( All malware binaries are compressed by 7Zip with password “malware” protection. This post is about the first malware/ransomeware uploaded. Before you start, make sure you have a Windows VM (KVM/VirtualBox/VMware) ready with networking disconnected from the host machine.

2. Summary

This malware is a kind of ransomware. The IP/domain of the networking is encoded instead of plain text. The encryption routine is also a DIY method without calling existing crypto libraries. Most imports are ReadFile/CreateFile/DeleteFile. 2 new entries are added into the registry, and one of them is the malware itself. All *.doc, *.txt, *.jpg, and etc. under C: are encrypted. A DNS query is also triggered for domain “”. The file “CryptoLogFile.txt” may be used to detect this malware, since it is created at first to log all the files encrypted. “” seems not a helpful signature, since it is a valid domain.

3. Static analysis

  • Is it packed?

Seems not.


PEiD: Nothing Found (but shows Win32 GUI as its subsystem).


PEview: A lot of imports can be found.


pestudio: same as PEview

If PEiD is able to detect packer, it would provide the information of the packer, which can be used to find the unpacker; If PEiD fails, we have to refer to PEview/pestudio, investigating the imports and section contents manually.

  • Compilation data?


PEview: 2009/10/09.

  • GUI or CLI?

Seems CLI.

PEview: There is no GUI related functions or DLL found in the text section.


Depends: It only depends on user32.dll, kernel32.dll, and shell32.dll.

  • Imports?

PEview: file related operations (CreateFile, DeleteFile, FindFile, ReadFile, WriteFile), a bunch of ‘get’ functions (GetCommandLine, GetEnvironmentVariable, GetFileSize, GetLogicalDrives, GetWindowsDirectory), and some string operations (lstrcat, lstrcmp, lstrcpy). A wild guess would be this malware goes into the Windows directory, removes the target files, and also creates some new files.


  • Strings?

Process: NA
File: user32.dll, kernel32.dll, shell32.dll, CryptLogFile.txt, wallpaper.bmp, .txt, .doc, .xls, .db, .mp3, .waw, .jpg, .rtf, .pdf, .zip,




  • Sections and contents?

PEview: text seems OK; rdata contains imports address table, directory table, name table, as shown in (d); data contains 2 interesting file names (CryptLogFile.txt, wallpaper.bmp); rsrc contains some icons, which seem fine.


ResourceHacker: rsrc section looks no code embedded.



  • The first file created

There are 3 subroutines and main calling CreateFile:



The main function prepares the file name to be “c:\windows\CryptoLogFile.txt”,


and then save it at byte_403F28 after preprocessing – removing the char 22h,


and then call sub_4015B5, which calls CreateFile the first time, which then creates the file using the filename at byte_403F28, which is the CryptoLogFile.txt.


  • dword 0xCA6B93C9

The DIY encryption routine uses this table to look up for a value, then XOR with the original value to achieve the encryption. If the encryption was PKI, then this secret data buffer could hold keys, .e.g., private key can be used to encrypt, while the public key would be sent to the attacker asking for ransom.

  • sub_401000

This routine starts with FindFirstFileA to find a specific file, and returns if the search fails (locret_401261), and keeps looping till all the target files have been gone thru.


  • sub_40140D

I would try to name it as – read the file, encrypt it into a new file, and remove the original file.


It also calls the shell del command to remove the file:


  • sub_401263

I would rename it as – DIY encryption routine, especially after I saw the operations like:

xor eax, the_secret_look_up_table[edx]

4. Dynamic analysis


REMnux: start inetsim


start apateDNS


start Process Explorer


start Procmon (then pause and clear)


start RegShot (the 1st shot)


Unpause the Procmon; Execute the malware; Pause the Procmon (seems it got hang every time…)


Take 2nd RegShot

  • Interesting behaviors that occur after the malware has executed.


  • Machines and services the malware attempts to contact by IP or domain or host name.


  • Registry keys created/modified by the malware


  • Files created/modified by the malware


There are also files encrypted outside the windows directory, e.g., the Dynamic Analysis directory on the desktop. Since I was scanning the dir only under c:\windows, these files are not shown in RegShot. However, CryptoLogFile lists all the encrypted files (how nice is that).

  • Processes started by the malware

Notepad, and maybe else (Procmon stuck when pause…)


5. Indicators of compromise

A lot of files have been encrypted, as listed in CryptoLogFile.txt. For example, one of the README.txt looks like below. And, for sure, comes the “new” wallpaper with introduction to ransomware, and ways to pay the ransom.



6. Disinfection and remedies

To make sure this ransomware will not start again, need to do a clean up in the registry. If there is a data backup (there should be), or a system snapshot, do a recover – yeah, problem solved. If there is no data backup, and I am able to decrypt the encryption routine (DIY crypto could be vulnerable comparing to other common crypto methods and implementations), then it is time to learn maths and assembly. Otherwise, which may be the most common way, pay the ransom.

Posted in Security, Static Code Analysis | Tagged , , , , , , , , , | Leave a comment

gcc, llvm, and Linux kernel

This post talks about what happened recently in the Linux kernel mailing list discussion. While this post does not dig into compiler internals or the whole picture between the Linux kernel and compilers, we discuss 2 specific issues from gcc and llvm respectively. The gcc issue may be a quirk but the llvm issue is definitely a bug. Keep reading…

1. leal %P1(%%esp),%0

The title is the inline assembly used at arch/x86/boot/main.c line 121. The thing seems weird is the ‘P’ in ‘%P1’, which is not the common comparing to ‘%1’ we used to see in gcc inline assembly. So what is the heck[1]? Let us try to put this kernel inline into a main function where we could play with gcc easily:

#include <stdio.h>
#define STACK_SIZE	512
static int stack_end;

int main()
	asm("leal %P1(%%esp),%0"
		: "=r" (stack_end)
		: "i" (-STACK_SIZE));

	return 0;

Then we assemble the code (gcc -S) and look at the assembly, where we can see the inline is interpreted as follows:

leal -512(%esp),%eax

This is exactly the thing we want for ‘leal’. In a word, gcc does not complain anything about this ‘P’. What if we remove the ‘P’ and look at the assembly again? After a quick trial, here is the inline generated by gcc:

leal $-512(%esp),%eax

Oops, gcc recognizes the ‘%1’ is an immediate value and appends ‘$’ (AT&T style) automatically. This may be right in most cases but definitely wrong for ‘lea’. As a matter a fact, if I try to compile the code directly, gcc would not let me do that. Now it is clear that the tricky ‘P’ in ‘%P1’ is used to make gcc happy and work. Note that I am using gcc 4.9.2. Latest gcc (5/6?) seems having fixed this quirk already – generating the same and correct assembly with or without the mysterious ‘P’. Go try yourself.

2. pushf/popf

The original issue was reported from usbhid testing using llvm-compiled kernel[2]. With kernel developers’ further debugging, the root cause of the bug is clear, pointing to the llvm rather than the kerne code itself[3]. Let us go thru the example described in the llvm mailing list. Here is the source file:

#include <stdlib.h>;
#include <stdbool.h>;

/* Assume foo changes the IF in EFLAGS */
void foo(void);
int a;

int bar(void)
	bool const zero = a -= 1;
	asm volatile ("" : : : "cc");
	if (zero) {
		return EXIT_FAILURE;

The point is foo() may (or not) change the IF in the EFLAGS. Compile it to generate the object file (clang -O2  -c -o ) and disassemble it as shown below (objdump -S):

[daveti@daveti c]$ objdump -S llvm_if_issue.o

llvm_if_issue.o:     file format elf64-x86-64

Disassembly of section .text:

0000000000000000 <bar>:
   0:	53                   	push   %rbx
   1:	e8 00 00 00 00       	callq  6 <bar+0x6>
   6:	ff 0d 00 00 00 00    	decl   0x0(%rip)        # c <bar+0xc>
   c:	9c                   	pushfq
   d:	5b                   	pop    %rbx
   e:	e8 00 00 00 00       	callq  13 <bar+0x13>
  13:	b8 01 00 00 00       	mov    $0x1,%eax
  18:	53                   	push   %rbx
  19:	9d                   	popfq
  1a:	75 07                	jne    23 <bar+0x23>
  1c:	e8 00 00 00 00       	callq  21 <bar+0x21>
  21:	31 c0                	xor    %eax,%eax
  23:	5b                   	pop    %rbx
  24:	c3                   	retq

Let us focus on the interesting part:

   c:	9c                   	pushfq
   d:	5b                   	pop    %rbx
   e:	e8 00 00 00 00       	callq  13 <bar+0x13>
  13:	b8 01 00 00 00       	mov    $0x1,%eax
  18:	53                   	push   %rbx
  19:	9d                   	popfq

As you can see here, before bar() calls foo(), it saves EFLAGS on the stack using ‘pushf’. After the foo() is done, it recovers the EFLAGS from the stack using ‘popf’. Remember our assumption – foo() may change the IF in the EFLAGS! Now we could explain the bug found in usbhid. The foo() is spin_lock_irq(), and the bar() is usbhid_close(). While spin_lock_irq() makes sure the interrupt disabled, usbhid_close() used the old value of EFLAGS, ignoring what happens in spin_lock_irq().

3. Summary

The gcc quirk may reflect the hackish fix of gcc in the early days to satisfy the kernel compilation requirement. After all, gcc is the only compiler without any patches in the kernel to compile the Linux kernel. As such, Linux kernel is the only project leveraging different gcc features other projects would never bother. On the other hand, llvm is catching up. There are kernel patches already to make llvm compile the kernel, and people are testing llvm kernel images. Nevertheless, the EFLAGS clobbering issue in llvm optimization may be a showstopper. Most user-space applications do not care about interrupt, however, it is the core requirement for the kernel to work as expected. As Linus pointed out – “Using pushf/popf in generated code is completely insane (unless done very localized in a controlled area).

4. Reference


Posted in OS, Stuff about Compiler | Tagged , , , , | Leave a comment

Defending Against Malicious USB Firmware with GoodUSB

Finally, 4 months after our paper was accepted by ACSAC’15, I could now write a blog talking about our work – GoodUSB, and release the code, due to some software patent bul*sh*t. (I sincerely think software patent should be abolished from the very start!) Anyway, this post is all about malicious USB firmware, BadUSB attacks, and our defense solution from the Linux kernel – GoodUSB. Go ahead to download GoodUSB and play with it. Any question, shoot me an email.

0. To memorize the old paper title given by Dr. Bates

GoodUSB: How I Learned to Stop Worrying and Love the Rubber Ducky

1. A quote from a chat in Skype[1]

“I read an article about how a dude in the subway fished out a USB flash drive from the outer pocket of some guy’s bag. The USB drive had “128” written on it. He came home, inserted it into his laptop and burnt half of it down. He wrote “129” on the USB drive and now has it in the outer pocket of his bag…”

2. BadUSB attacks

If you read the reference link, you will find that the “USB flash drive” (or USB Killer) has embedded some capacitors supporting high negative voltage (-110V). Once charged, these capacitors are able to cause over current in the USB signal line. While we are not going to talk about this in details (probably in another post), it does leave us a question: “Arriving at work, you find a USB drive on your table. What would you do?”[1]

In reality, data exfiltration or backdoor injection is preferred to burning down the machine. This is what USB rubber ducky[2] designed for. As a penetration testing tool, the USB rubber ducky looks like a USB thumb drive, but quacks like a keyboard, and types like a keyboard. Therefore, it is a keyboard. The only difference between a keyboard and a USB rubber ducky besides the appearance is that unlike normal keyboards, a USB rubber ducky does not need a human being to type the keystrokes – an adversary can write a malicious script, compile and load it into the ducky, and the ducky will execute it once plugged. How cool is that! A more powerful programable USB device is Teensy[3]. Teensy 3.1 development has been integrated into Arduino IDE. Similar with USB Killer, the best possible defense solution would be to open the case, and look at the PCB board carefully (as long as you know what you are looking at…).

Unfortunately, BadUSB[4] attacks in BlackHat 2014 made our best try so far a vain. Rather than requiring a specific USB micro controller, people could write malicious firmware by themselves on common USB micro controllers, thanks to the existing firmware building tools[5]. This means that a USB flash drive could behave like a storage and a keyboard the same time. While the storage provides the normal usage, the keyboard part is essentially a USB rubber ducky. The problem now is that we will not know if the firmware is malicious or not util it is plugged. And most of the time, we do not even know that there is a keyboard enabled, since it happens within the OS. Now let us repeat the question again: “Arriving at work, you find a USB drive on your table. What would you do?”

3. Root Cause Analysis

The root of BadUSB attacks originates from the USB spec. A USB device is able to have multiple functionalities (interfaces). Think about a USB headset, which contains audio functionalities (speaker + microphone), and a input/keyboard functionality (volume control). Therefore, there is no violation from the spec point for a USB storage device to have a keyboard functionality (and for some storage devices, this extra keyboard functionality may be needed as we will talk about this later). In reality, when a USB device is plugged into the host machine, it can report any functionalities (interfaces) that need OS’s support. The OS would try its best to find the corresponding driver to serve each of the functionality. Think about a BadUSB thumb drive. When it is plugged into the host machine, it reports itself with both a storage and an input (keyboard) interfaces during the USB enumeration (a procedure for the host machine to recognize the device). The OS then loads a storage driver and an input driver to make the device function. Once the input driver is loaded, the BadUSB device types a malicious script (like a human being), which is executed by the OS automatically. All of these happen in the OS within a second while the user is going through the files saved in the storage.

4. GoodUSB

The OS knows nothing about the USB device but is able to load different drivers to make the device happy (work); the user knows something about the device, e.g., from the appearance of the device, but is not able to interpose between the OS and the device. To bridge this semantic gap, ideally, we need a way to let the user and the OS talk:

User: I have just plugged in a USB flash drive.
OS: OK. I will not allow it to have a keyboard functionality then.

Essentially, this is GoodUSB.

As a end-to-end & systematic solution defending against malicious USB firmware, GoodUSB does not only include a customized Linux kernel but also a user-space daemon supporting GUI and a Honeypot KVM (HoneyUSB) for redirecting suspicious devices during run-time or start-time. While I am not going to list technical details here, I put the GoodUSB architecture figure here for a flavor and redirect further interests to our paper.


When the USB device is plugged into the host machine for the first time, the device class identifier in the kernel would try to fingerprint the firmware to get the signature (SHA1). The kernel then suspends further actions and sending the information about the device to the user-space daemon before enabling the device. The GoodUSB user-space daemon (gud) pops out a GUI asking for the user’s expectation about this device, as shown in the figure below:


Note that the choices in the GUI are high-level description for the device without any low-level USB spec terms. One beauty of GoodUSB is that once the user could give a general description of the device, the policy engine within gud is able to find right possible functionalities (interfaces) required to enable the device with the least “permission” (if we treat drivers as permissions). For instance, if the user choses “USB Storage”, no keyboard (input) functionality will be enabled for sure. After this, another GUI would pop out letting the user to bind this device with a security picture (just like a security picture when logging into online banking system). Now gud has all the information it needs. Besides updating the local device database, it relays all the information to the kernel, which could further configure the device as needed and expected by the user. When the device is plugged in for the 2nd time, the kernel is able to recognize it and asks for confirmation from the user via gud:


However, if the device is shown as a green dinosaur but the user knows it should be a red one, then the user is aware that the firmware of the device has been changed (to mimic the device bound with a green dinosaur). In the case, after “This is NOT my device!” is clicked, the device will be redirected into HoneyUSB, where we have implemented a USB profiler (usbpro) to inspect the behaviors of the device. Even though GoodUSB was designed against BadUSB attacks, its ability to customize the functionalities to be enabled for a USB device is also invaluable in daily use. E.g., GoodUSB is able to shutdown the microphone in a USB headset but leaving the speaker working as usual.

5. Limitations

As other 0-day stuffs, GoodUSB is not able to defend against 0-day malicious firmware. If the keyboard is able to input scripts automatically, there is nothing GoodUSB can do. As readers may have been realized, GoodUSB relies on the trust of drivers. If the driver is malicious, GoodUSB does not work. Another thing I have to mention here is USB quirks. Although we have tried to cover as many devices as possible, there are always USB quirks, which would not function properly with GoodUSB. One example would be Yubikey, which looks like a thumb driver, has a USB hub functionality, and behaves like a keyboard. The last limitation comes from us – human beings. GoodUSB uses GUI and security pictures with the hope to help users make a better judgement. Again, this is our hope. We have not done and will not do any user study to show the validity of using GUI and security pictures. Usability is beyond the scope of the paper.

6. RtDC


Posted in OS, Security | Tagged , , , , , , , | 1 Comment

Linux kernel hacking – one relay file for all CPUs

I wrote a post about kernel relay 2 years go ( However, I have realized that I did not understand relay until recently when I was debugging a relay-related bug. Though I was working on RHEL 2.6.32 kernel, this post also applies for the latest 4.3 kernel by the time of writing. After all, the kernel relay has been stable for more than a decade. May this post help understand kernel relay a little bit better.

0. When relay is init’d normally

Like my old post described, when the relay is initialized normally. There should be a relay file under /sys/kernel/debug for each CPU. As you would expected, there is a per-cpu buffer in the struct rchan to avoid potential locking among different CPUs.

  68        struct rchan_buf *buf[NR_CPUS]; /* per-cpu channel buffers */

The user-space code has to go thru all the relay files to receive the data from the kernel (select()) then.

1. Could we just have 1 relay file in the user-space?

When relay_open() is called to start the relay, the per-cpu buffer would be created one by one given the total number of CPUs online:

 603        for_each_online_cpu(i) {
 604                chan->buf[i] = relay_open_buf(chan, i);
 605                if (!chan->buf[i])
 606                        goto free_bufs;
 607        }

When the kernel starts to relay sth in relay_write():

 207        buf = chan->buf[smp_processor_id()];

smp_processor_id() determines the current CPU id when the current code is running and the corresponding per-cpu buffer would be used for that CPU to hold the data. Now the question is: could we make all the CPUs use just one “per-cpu” buffer?

2. A dirty hack!

Short answer is yes. This is done by a dirty hack in the struct rchan_callbacks:

 143        struct dentry *(*create_buf_file)(const char *filename,
 144                                          struct dentry *parent,
 145                                          umode_t mode,
 146                                          struct rchan_buf *buf,
 147                                          int *is_global);

In short, besides all the relay code we had, we also need to tune the callback named create_buf_file() and mark the is_global to be 1 (true). Besides the opportunity for us to customize the location of the relay file provided by this callback, it also gives us a chance to let the kernel know that we want a “global” buffer for all CPUs.

Here is the reason why I call it dirty:

 442        if (chan->is_global)
 443                return chan->buf[0];
 445        buf = relay_create_buf(chan);
 446        if (!buf)
 447                return NULL;
 449        if (chan->has_base_filename) {
 450                dentry = relay_create_buf_file(chan, buf, cpu);
 451                if (!dentry)
 452                        goto free_buf;
 453                relay_set_buf_dentry(buf, dentry);
 454        }

When relay_open_buf() is called by the relay_open() for CPU0, the is_global flag saved in the struct rchan is still 0 (false) after initialization. So the per-cpu buffer will be created for CPU0 and relay_create_buf_file() will be called to create the relay file in the filesystem. But, before relay_create_buf_file returns, it calls our create_buf_file callback (finally!):

 423        dentry = chan->cb->create_buf_file(tmpname, chan->parent,
 424                                           S_IRUSR, buf,
 425                                           &chan->is_global);

Remember that we have fixed the is_global to be 1 in our callback? Here we pass the value from our callback to the is_global flag saved in the struct rchan in the kernel – how dirty is that! When relay_open_buf() tries to create a “per-cpu” for CPU1, it recognizes the is_global flag and sets the CPU1 buffer pointing to the CPU0’s.

3. Global buffer vs. per-cpu buffer

Global buffer is friendly to the user space, since no select() is needed. However, because all CPUs try to write to the save buffer, some locking mechanism is needed to serialize the access, as well as a big buffer to satisfy all CPUs given a period. However, if the system is NUMA, global buffer is apparently a bad idea and one should stick with per-cpu buffer to take the advantage of the NUMA.

Posted in Linux Distro, OS | Tagged , , , | Leave a comment

Linux kernel hacking – support SO_PEERCRED for local TCP socket connections

In my old post (, we talked about how to retrieve the peer PID from Unix domain socket using struct ucred. A more smart way to do this is using getsockopt() syscall with option SO_PEERCRED directly. As you expected (or not), this mechanism only works for Unix domain sockets. After all, why would we be interested in the PID of the peer socket in the other machine? But, what about local TCP/UDP connections? Why couldn’t we have this mechanism as well? This post gives technical details of how to implement the SO_PEERCRED support for local TCP socket connections within the Linux kernel. For more information, please R.t.D.C.

0. Finding the PID given the socket in the user space

To motivate a little bit, please consider the task as titled. I’m so sure that most sysadmins have got similar experience – finding the process using the specific socket. A most common way is to use netstat and grep. It works though pretty slow. Using libc system() embedded with a simple netstat script yields an overhead around 80 ms. Still, this is fine if the task is one-time shot and is not the bottle neck of the whole program. Otherwise, we can ask if we could do better.

In my opinion, this is the partial reason why ss is created. ss leverages a kernel module called tcp_diag, which uses the Linux kernel inet diagnostic interface to hook up TCP sockets, to accelerate the speed to retrieve TCP connection information from the kernel, with the help of the inet diag netlink socket, rather than digging around the /proc rudely (what netstat does). Thanks to tcp_diag, ss is able to know the backend file descriptor (FD) of the socket, based on which a /proc/X(pid)/fd/ search can reveal the right PID. A normal ss usage to find the PID using TCP port 22 (SSH) produces around 8 ms. Note that you have to make sure the tcp_diag kernel module is loaded. Otherwise, ss will do the same as netstat. The problem of ss is that it still needs to go thru all the /proc/X/ to have the mapping information between PID and FD, which is not scalable. Besides, 8 ms is still a big overhead in some user-space applications. So, can we make it faster?

1. Supporting SO_PEERCRED for local TCP socket connections in the Linux kernel

Finally, we are getting to the core of this post! Yes, we could make it faster. I mean really fast, less than 30 us! You are now finally interested in what I have done, right? Let us recall what have done for Unix domain socket. To retrieve the PID of the peer socket, all we need is a getsockopt() syscall with option SO_PEERCRED. Therefore, the overhead can be seen from the user space is just the overhead of getsockopt() syscall. Doesn’t this sound exciting! What we are going to do is to implement similar mechanism for local TCP socket. Warning: this may require you to have some Linux kernel networking knowledge before hand for a better understanding. E.g., it is good to know what skb is. Nevertheless, I will try to make things easier to understand while not offending other kernel hackers:) Ready? Go!

a. Look into SO_PEERCRED

When getsockopt() syscall is called with SO_PEERCRED in the user space, the code path goes into sock_getsockopt() in net/core/sock.c. You will find the code snippet for Linux kernel 2.6.32:

        case SO_PEERCRED:
 867                if (len > sizeof(sk->sk_peercred))
 868                        len = sizeof(sk->sk_peercred);
 869                if (copy_to_user(optval, &sk->sk_peercred, len))
 870                        return -EFAULT;
 871                goto lenout;

As one can tell, what it does is just copying the sk->sk_peercred, which is struct ucred containing pid/uid/gid, to the user space. This code works for Unix domain sockets and now we will make it work for TCP sockets. The take-away here is now we know where we should put the PID. BTW, sk is struct sock, the network layer representation of socket in the kernel.

b. Make a TCP connection

The next question we need to answer is where a new TCP connection happens, since we want to find the peer PID as soon as a new connection comes. The kernel API tcp_v4_conn_request() in net/ipv4/tcp_ipv4.c is the answer. This function receives 2 parameters, a struct sock *sk, standing for the TCP server, and a struct sk_buff *skb, standing for a packet passing thru the whole TCP/IP stack within the kernel (yep, you hear me – skb is the key to Linux kernel networking hacking, though I am not going to talk more).

int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)

What this function does is to accept/reject a new TCP connection request from skb. Another interesting thing in this function is a security hook:

        if (security_inet_conn_request(sk, skb, req))
1279                goto drop_and_free;

This security hook gives LSM (Linux Security Module) a chance to grant/deny the TCP connection based on security polices. To make our kernel hacking as less intrusive as possible, I decided to instrument the selinux_inet_conn_request() API in security/selinux/hooks.c, since CentOS is using SELinux for LSM.

static int selinux_inet_conn_request(struct sock *sk, struct sk_buff *skb,
4299                                     struct request_sock *req)

c. Assume the world is perfect

Look at the selinux_inet_conn_request() again. We have got a struct sock (*sk) and a connection request packet from the peer (*skb). Moving forward, we could find that skb also keeps a back reference to its parent struct sock. Since we are dealing with local connections, we (at least myself) assume that we should be able to trace back the struct sock from skb. Then the question would be how to retrieve the PID from struct sock. The answer is skb->sk->socket->file->f_owner->pid, which displays a possible path from skb back to the backend file of the socket (VFS), where PID is trivial to have. However, the world is not perfect. We could not even have the reference to the struct sock within the skb. On the other hand, we are so sure that skb->sk should point back to its parent struct sock when the skb (packet) is generated from the sock (socket). What is wrong?

d. “I am a strange loop”

All packets are finally queued in the network device for sending and receiving. Because we only consider local connections, all IP packets with target IP belonging to local or are essentially “transmitted” using a loopback device. Let us go to the device driver for this loopback device – loopback_xmit() in drivers/net/loopback.c.

  69 * The higher levels take care of making this non-reentrant (it's
  70 * called with bh's disabled).
  71 */
  72static netdev_tx_t loopback_xmit(struct sk_buff *skb,
  73                                 struct net_device *dev)
  75        struct pcpu_lstats *pcpu_lstats, *lb_stats;
  76        int len;
  78        skb_orphan(skb);
  80        skb->protocol = eth_type_trans(skb, dev);
  82        /* it's OK to use per_cpu_ptr() because BHs are off */
  83        pcpu_lstats = dev->ml_priv;
  84        lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
  86        len = skb->len;
  87        if (likely(netif_rx(skb) == NET_RX_SUCCESS)) {
  88                lb_stats->bytes += len;
  89                lb_stats->packets++;
  90        } else
  91                lb_stats->drops++;
  93        return NETDEV_TX_OK;

When a new packet is to be sent locally, the network core calls loopback_xmit() to transmit the packet to the target, which is ourselves! Therefore, it calls netif_rx(), which just pushes the packet into its receiving queue directly, to send this packet. A software IRQ will be then raised to notify the CPU to handle this “new” packet. A more interesting thing in this function is skb_orphan(). I will let you guess what it does. Yes, it removes the back reference to the parent struct sock from the skb!

e. “Mercy Mercy Me”

OK, let’s try to not “orphan” the skb in the loopback device. Urr, it still does not work. Now are getting smarter. Let’s try to do a code search for skb_orphan() in the whole kernel source. Oops, there are tons of callings around the TCP networking implementation. E.g., when the packet is passed to the IP layer, ip_rcv() in net/ipv4/ip_input.c would “orphan” the packet because of tproxy (Transparent Proxy). On one hand, this explains again why we cannot trace back the struct sock from skb even for local connections; on the other hand, this implies that kernel basically does not distinguish local packets from non-local packets at the level of skb processing once the packet is received.

f. K.I.S.S.

Though I am personally not in favor of this solution due to the potential cache impact, it is clear that we need to have a new field to save PID in skb. Then during loopback_xmit(), we need to find the PID and assign the value to the skb new field, leaving all those “orphan”s doing whatever they wanna do. To find the PID from the struct sock, we have already learned to use sk->socket->file->f_owner->pid. Unfortunately, there is still a problem, the pid within f_owner is NULL! (WTF!) Now we (at least myself) are so angry that we go straightforward into the sock_alloc_file() in net/socket.c, where the backend file of the socket is created, and add the damn PID to the damn f_owner->pid. Finally, the world is getting better:)

2. Code

Within the code repo (, there are 2 directories. The kernel directory contains a complete Linux kernel 2.6.32 patched with this cool feature can be used directly by CentOS 6.7. The user directory contains a simple TCP server/client, where the TCP server uses getsockopt with SO_PEERCRED to retrieve the PID of the TCP client. The kernel log is also included for debugging purpose.

3. What about UDP?

So far, I have neither talked about UDP nor investigated the possible hacking implementation. It is possible that the implementation for UDP could be the similar as the one for Unix domain, since both of them are datagram based; it is also possible, however, that the hacking would be heavily intrusive, since UDP is connection-less. Before I could find some time to dig around the UDP implementation, all I could say for now is TBD:)

4. K.R.K.C.

I hope you enjoy this post. This should be my longest post so far since I have covered a lot of kernel hacking knowledge and it took me the whole night to write it. Any comment is welcomed. Finally, life is short; please hack the kernel!

Posted in Linux Distro, Network, OS | Tagged , , , , , , , , , , , , , , , , , | 6 Comments