Linux kernel hacking – one relay file for all CPUs

I wrote a post about kernel relay 2 years go (https://davejingtian.org/2013/06/29/relay-linux-kernel-relay-filesystem/). However, I have realized that I did not understand relay until recently when I was debugging a relay-related bug. Though I was working on RHEL 2.6.32 kernel, this post also applies for the latest 4.3 kernel by the time of writing. After all, the kernel relay has been stable for more than a decade. May this post help understand kernel relay a little bit better.

0. When relay is init’d normally

Like my old post described, when the relay is initialized normally. There should be a relay file under /sys/kernel/debug for each CPU. As you would expected, there is a per-cpu buffer in the struct rchan to avoid potential locking among different CPUs.

  68        struct rchan_buf *buf[NR_CPUS]; /* per-cpu channel buffers */

The user-space code has to go thru all the relay files to receive the data from the kernel (select()) then.

1. Could we just have 1 relay file in the user-space?

When relay_open() is called to start the relay, the per-cpu buffer would be created one by one given the total number of CPUs online:

 603        for_each_online_cpu(i) {
 604                chan->buf[i] = relay_open_buf(chan, i);
 605                if (!chan->buf[i])
 606                        goto free_bufs;
 607        }

When the kernel starts to relay sth in relay_write():

 207        buf = chan->buf[smp_processor_id()];

smp_processor_id() determines the current CPU id when the current code is running and the corresponding per-cpu buffer would be used for that CPU to hold the data. Now the question is: could we make all the CPUs use just one “per-cpu” buffer?

2. A dirty hack!

Short answer is yes. This is done by a dirty hack in the struct rchan_callbacks:

 143        struct dentry *(*create_buf_file)(const char *filename,
 144                                          struct dentry *parent,
 145                                          umode_t mode,
 146                                          struct rchan_buf *buf,
 147                                          int *is_global);

In short, besides all the relay code we had, we also need to tune the callback named create_buf_file() and mark the is_global to be 1 (true). Besides the opportunity for us to customize the location of the relay file provided by this callback, it also gives us a chance to let the kernel know that we want a “global” buffer for all CPUs.

Here is the reason why I call it dirty:

 442        if (chan->is_global)
 443                return chan->buf[0];
 444
 445        buf = relay_create_buf(chan);
 446        if (!buf)
 447                return NULL;
 448
 449        if (chan->has_base_filename) {
 450                dentry = relay_create_buf_file(chan, buf, cpu);
 451                if (!dentry)
 452                        goto free_buf;
 453                relay_set_buf_dentry(buf, dentry);
 454        }
 455

When relay_open_buf() is called by the relay_open() for CPU0, the is_global flag saved in the struct rchan is still 0 (false) after initialization. So the per-cpu buffer will be created for CPU0 and relay_create_buf_file() will be called to create the relay file in the filesystem. But, before relay_create_buf_file returns, it calls our create_buf_file callback (finally!):

 423        dentry = chan->cb->create_buf_file(tmpname, chan->parent,
 424                                           S_IRUSR, buf,
 425                                           &chan->is_global);

Remember that we have fixed the is_global to be 1 in our callback? Here we pass the value from our callback to the is_global flag saved in the struct rchan in the kernel – how dirty is that! When relay_open_buf() tries to create a “per-cpu” for CPU1, it recognizes the is_global flag and sets the CPU1 buffer pointing to the CPU0’s.

3. Global buffer vs. per-cpu buffer

Global buffer is friendly to the user space, since no select() is needed. However, because all CPUs try to write to the save buffer, some locking mechanism is needed to serialize the access, as well as a big buffer to satisfy all CPUs given a period. However, if the system is NUMA, global buffer is apparently a bad idea and one should stick with per-cpu buffer to take the advantage of the NUMA.

About daveti

Interested in kernel hacking, compilers, machine learning and guitars.
This entry was posted in Linux Distro, OS and tagged , , , . Bookmark the permalink.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.