Linux kernel hacking – support SO_PEERCRED for local TCP socket connections

In my old post (https://davejingtian.org/2015/02/17/retrieve-pid-from-the-packet-in-unix-domain-socket-a-complete-use-case-for-recvmsgsendmsg/), we talked about how to retrieve the peer PID from Unix domain socket using struct ucred. A more smart way to do this is using getsockopt() syscall with option SO_PEERCRED directly. As you expected (or not), this mechanism only works for Unix domain sockets. After all, why would we be interested in the PID of the peer socket in the other machine? But, what about local TCP/UDP connections? Why couldn’t we have this mechanism as well? This post gives technical details of how to implement the SO_PEERCRED support for local TCP socket connections within the Linux kernel. For more information, please R.t.D.C.

0. Finding the PID given the socket in the user space

To motivate a little bit, please consider the task as titled. I’m so sure that most sysadmins have got similar experience – finding the process using the specific socket. A most common way is to use netstat and grep. It works though pretty slow. Using libc system() embedded with a simple netstat script yields an overhead around 80 ms. Still, this is fine if the task is one-time shot and is not the bottle neck of the whole program. Otherwise, we can ask if we could do better.

In my opinion, this is the partial reason why ss is created. ss leverages a kernel module called tcp_diag, which uses the Linux kernel inet diagnostic interface to hook up TCP sockets, to accelerate the speed to retrieve TCP connection information from the kernel, with the help of the inet diag netlink socket, rather than digging around the /proc rudely (what netstat does). Thanks to tcp_diag, ss is able to know the backend file descriptor (FD) of the socket, based on which a /proc/X(pid)/fd/ search can reveal the right PID. A normal ss usage to find the PID using TCP port 22 (SSH) produces around 8 ms. Note that you have to make sure the tcp_diag kernel module is loaded. Otherwise, ss will do the same as netstat. The problem of ss is that it still needs to go thru all the /proc/X/ to have the mapping information between PID and FD, which is not scalable. Besides, 8 ms is still a big overhead in some user-space applications. So, can we make it faster?

1. Supporting SO_PEERCRED for local TCP socket connections in the Linux kernel

Finally, we are getting to the core of this post! Yes, we could make it faster. I mean really fast, less than 30 us! You are now finally interested in what I have done, right? Let us recall what have done for Unix domain socket. To retrieve the PID of the peer socket, all we need is a getsockopt() syscall with option SO_PEERCRED. Therefore, the overhead can be seen from the user space is just the overhead of getsockopt() syscall. Doesn’t this sound exciting! What we are going to do is to implement similar mechanism for local TCP socket. Warning: this may require you to have some Linux kernel networking knowledge before hand for a better understanding. E.g., it is good to know what skb is. Nevertheless, I will try to make things easier to understand while not offending other kernel hackers:) Ready? Go!

a. Look into SO_PEERCRED

When getsockopt() syscall is called with SO_PEERCRED in the user space, the code path goes into sock_getsockopt() in net/core/sock.c. You will find the code snippet for Linux kernel 2.6.32:

        case SO_PEERCRED:
 867                if (len > sizeof(sk->sk_peercred))
 868                        len = sizeof(sk->sk_peercred);
 869                if (copy_to_user(optval, &sk->sk_peercred, len))
 870                        return -EFAULT;
 871                goto lenout;
 872

As one can tell, what it does is just copying the sk->sk_peercred, which is struct ucred containing pid/uid/gid, to the user space. This code works for Unix domain sockets and now we will make it work for TCP sockets. The take-away here is now we know where we should put the PID. BTW, sk is struct sock, the network layer representation of socket in the kernel.

b. Make a TCP connection

The next question we need to answer is where a new TCP connection happens, since we want to find the peer PID as soon as a new connection comes. The kernel API tcp_v4_conn_request() in net/ipv4/tcp_ipv4.c is the answer. This function receives 2 parameters, a struct sock *sk, standing for the TCP server, and a struct sk_buff *skb, standing for a packet passing thru the whole TCP/IP stack within the kernel (yep, you hear me – skb is the key to Linux kernel networking hacking, though I am not going to talk more).

int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
1212{

What this function does is to accept/reject a new TCP connection request from skb. Another interesting thing in this function is a security hook:

        if (security_inet_conn_request(sk, skb, req))
1279                goto drop_and_free;

This security hook gives LSM (Linux Security Module) a chance to grant/deny the TCP connection based on security polices. To make our kernel hacking as less intrusive as possible, I decided to instrument the selinux_inet_conn_request() API in security/selinux/hooks.c, since CentOS is using SELinux for LSM.

static int selinux_inet_conn_request(struct sock *sk, struct sk_buff *skb,
4299                                     struct request_sock *req)
4300{

c. Assume the world is perfect

Look at the selinux_inet_conn_request() again. We have got a struct sock (*sk) and a connection request packet from the peer (*skb). Moving forward, we could find that skb also keeps a back reference to its parent struct sock. Since we are dealing with local connections, we (at least myself) assume that we should be able to trace back the struct sock from skb. Then the question would be how to retrieve the PID from struct sock. The answer is skb->sk->socket->file->f_owner->pid, which displays a possible path from skb back to the backend file of the socket (VFS), where PID is trivial to have. However, the world is not perfect. We could not even have the reference to the struct sock within the skb. On the other hand, we are so sure that skb->sk should point back to its parent struct sock when the skb (packet) is generated from the sock (socket). What is wrong?

d. “I am a strange loop”

All packets are finally queued in the network device for sending and receiving. Because we only consider local connections, all IP packets with target IP belonging to local or 127.0.0.1 are essentially “transmitted” using a loopback device. Let us go to the device driver for this loopback device – loopback_xmit() in drivers/net/loopback.c.

/*
  69 * The higher levels take care of making this non-reentrant (it's
  70 * called with bh's disabled).
  71 */
  72static netdev_tx_t loopback_xmit(struct sk_buff *skb,
  73                                 struct net_device *dev)
  74{
  75        struct pcpu_lstats *pcpu_lstats, *lb_stats;
  76        int len;
  77
  78        skb_orphan(skb);
  79
  80        skb->protocol = eth_type_trans(skb, dev);
  81
  82        /* it's OK to use per_cpu_ptr() because BHs are off */
  83        pcpu_lstats = dev->ml_priv;
  84        lb_stats = per_cpu_ptr(pcpu_lstats, smp_processor_id());
  85
  86        len = skb->len;
  87        if (likely(netif_rx(skb) == NET_RX_SUCCESS)) {
  88                lb_stats->bytes += len;
  89                lb_stats->packets++;
  90        } else
  91                lb_stats->drops++;
  92
  93        return NETDEV_TX_OK;
  94}

When a new packet is to be sent locally, the network core calls loopback_xmit() to transmit the packet to the target, which is ourselves! Therefore, it calls netif_rx(), which just pushes the packet into its receiving queue directly, to send this packet. A software IRQ will be then raised to notify the CPU to handle this “new” packet. A more interesting thing in this function is skb_orphan(). I will let you guess what it does. Yes, it removes the back reference to the parent struct sock from the skb!

e. “Mercy Mercy Me”

OK, let’s try to not “orphan” the skb in the loopback device. Urr, it still does not work. Now are getting smarter. Let’s try to do a code search for skb_orphan() in the whole kernel source. Oops, there are tons of callings around the TCP networking implementation. E.g., when the packet is passed to the IP layer, ip_rcv() in net/ipv4/ip_input.c would “orphan” the packet because of tproxy (Transparent Proxy). On one hand, this explains again why we cannot trace back the struct sock from skb even for local connections; on the other hand, this implies that kernel basically does not distinguish local packets from non-local packets at the level of skb processing once the packet is received.

f. K.I.S.S.

Though I am personally not in favor of this solution due to the potential cache impact, it is clear that we need to have a new field to save PID in skb. Then during loopback_xmit(), we need to find the PID and assign the value to the skb new field, leaving all those “orphan”s doing whatever they wanna do. To find the PID from the struct sock, we have already learned to use sk->socket->file->f_owner->pid. Unfortunately, there is still a problem, the pid within f_owner is NULL! (WTF!) Now we (at least myself) are so angry that we go straightforward into the sock_alloc_file() in net/socket.c, where the backend file of the socket is created, and add the damn PID to the damn f_owner->pid. Finally, the world is getting better:)

2. Code

Within the code repo (https://github.com/daveti/tcpSockHack), there are 2 directories. The kernel directory contains a complete Linux kernel 2.6.32 patched with this cool feature can be used directly by CentOS 6.7. The user directory contains a simple TCP server/client, where the TCP server uses getsockopt with SO_PEERCRED to retrieve the PID of the TCP client. The kernel log is also included for debugging purpose.

3. What about UDP?

So far, I have neither talked about UDP nor investigated the possible hacking implementation. It is possible that the implementation for UDP could be the similar as the one for Unix domain, since both of them are datagram based; it is also possible, however, that the hacking would be heavily intrusive, since UDP is connection-less. Before I could find some time to dig around the UDP implementation, all I could say for now is TBD:)

4. K.R.K.C.

I hope you enjoy this post. This should be my longest post so far since I have covered a lot of kernel hacking knowledge and it took me the whole night to write it. Any comment is welcomed. Finally, life is short; please hack the kernel!

About daveti

Interested in kernel hacking, compilers, machine learning and guitars.
This entry was posted in Linux Distro, Network, OS and tagged , , , , , , , , , , , , , , , , , . Bookmark the permalink.

5 Responses to Linux kernel hacking – support SO_PEERCRED for local TCP socket connections

  1. Kernel_DEV says:

    Hi,

    i have been working on the same but instead of selinux it is smack.
    is this or some other solution is avilable in mainline kernel ?
    i am mainly looking for mainline accepted patch for this.
    Can you please guide me on the same.

    • daveti says:

      The main reason i instrument selinux is to avoid tainting other part of the kernel. U could always do similar things in smack, either instrumenting the same hook or implementing the hook if it is not there. On the other hand, you can hack the tcp stack directly.

  2. Kernel_DEV says:

    Thanks foryour reply & time

  3. Kernel_DEV says:

    Your changes are directly merged to base code, and there is seprate patch i can find.
    I am working on 3.10 based kernel, so it is very difficult to check your changes.
    I will be very thankfull if you share the patch, it will help me a lot, please.

    is their any mailine soultion avialble ?

    • daveti says:

      R u using the official kernel? If so, i probably can do a patch some time. There are changes in mainline on how to retrieve the peer sock. But again, kernel only supports unix socket.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s