Our paper “Trustworthy Whole-System Provenance for the Linux Kernel” has been accepted by USENIX Security 2015. While details could be found in the paper (link below), I would like to talk about some background about LPM (a.k.a., Linux Provenance Module, our main contribution in the paper), current working status and some of my personal thoughts, in this post. Please note that this post does NOT stand for the opinion of the whole LPM team but my itch of writing as a developer and co-author.
0. The paper (linked from Adam’s website)
The most successful security feature in the Linux kernel could be LSM (Linux Security Module). I am sure people would argue about how I define “the most successful”. My rule is simple – unlike grsecurity, LSM is in the mainline and most of the security code under the security subdir in the Linux kernel source is about LSMs, e.g., SELinux and AppArmor. Basically, LSM is a set of kernel hooks distributed almost everywhere in the kernel source, where the operation of certain APIs is security sensitive and needs MAC (Mandatary Access Control). To enforce the policy, one needs to write a module which implements these hooks. Like it or not, major Linux distros usually apply some LSM implementations. RH/Fedora uses SELinux while Ubuntu loves AppArmor. Even Android has something called SE Android, which in essence is a kind of LSM. Even though LSM was designed for mandatary access control and most implementations are MAC indeed, a simple question would be what else we could do using LSM?
When we are talking about Provenance, we are talking about data provenance and/or lineage of data. Check out the wikipedia for a quick understanding – https://en.wikipedia.org/wiki/Provenance. My personal interpretation of provenance is “extra logging”, which could be used to track either the information flow or the working flow of the program or data and provides a trusted way for forensic analysis if something goes wrong. Most previous provenance works were targeting certain applications, making them prov’able. The assumption of application-level approach lies on the trust to the application itself, which has a long history of being vulnerable. The question here is could we have a better provenance for the application without trusting the application? The most inspiring work in system provenance is Hifi, which places some hooks within the Linux kernel and collects the logging for kernel activities. The Hifi system inspires our work from the very start.
The combination of the two questions mentioned above made LPM – Linux Provenance Module, which takes the advantage of LSM framework and provides the system-level trustworthy provenance for both the kernel and the application.
[picture from early draft of the paper, made by Adam Bates]
The good thing of using LSM framework as LPM is that we are now standing on the giant’s shoulder. Tons of research and even formal methods have been applied to LSM, making sure it satisfies the three properties of a reference monitor: Complete mediation; Tamperproof and formal verification. Unlike Hifi, LPM is able to make these arguments as well because of the nature of LSM framework. With the LPM framework, we move forward to create the ‘real’ LPM – provmon, which collects all the kernel activities and relays the logging information to the user space. For example application usage, such as data loss prevention, please refer to the paper.
4. Current status
A working provenance kernel 2.6 based on RHEL will be published soon. The LPM coexists with LSM. Actually, LPM is followed after LSM, acting as the 2nd level of ‘policy’ enforcement. It is also possible to have both SELinux and Provmon working the same time. The team is also investigating new security features to ‘stack’ LPM with LSM if possible, trying to make it available to all kernel releases in a non-intrusive way.
5. More considerations (NOTE: personal opinions)
During the development of LPM, an interesting question has been bothering us – what could LPM do for applications? While applications may be interested in its own data structure such as a ‘record’, the kernel is, unfortunately, agnostic about this information. Instead, kernel talks about inode, socket, packet and etc. Within the paper, we have instrumented an application to bridge the semantic gap between the application and the kernel. But could we do better? Would a new bunch of syscalls designed for provenance be better for application provenance with kernel support? Imagine future applications, which needs kernel-level provenance support, can just call some provenance syscall to have the full picture of information flow and working flow with kernel-level granularities.
Another thing I have been thinking is that if LSM is the best solution for provenance from the very start. LSM could be viewed as a horizontal layer in the kernel to provide a unified access control. However, when we rethink about the information flow and working flow from the point of applications, we may realize that we need vertical layers (not just only one layer) in the kernel to provide a complete picture – how a data structure in the application is mapped into different kernel objects and how these different objects interleave and transform from each other.