This post mainly talks about the correct usage of tsc counters provided by Intel x86/x86-64 architectures to measure the Linux kernel code execution time. Most of the content here is borrowed/inspired from [1]. Note that this is NOT a post introduing tsc but for people who need to use tsc anyway. A kernel module with us/ns/tsc measurement framework ready is also provided at the end of the post [4]. Happy measurement!
1. Why measuring time is hard?
Human beings have a long history of trying to measure time [3], although this is not our focus. We are talking about measuring a piece of code execution running at ring 0 on a modern x86/x86-64 Intel CPU. The measurement framework usually looks like below:
start_measurement(&s);
func_needs_to_measure();
end_measurement(&e);
print_measurement(e-s);
The most common measurement functions inside the kernel are do_gettimeofday(), which returns in macroseconds, and getnstimeofday64(), which returns in nanoseconds. Depending on the complexity of the code needs to be measured, different scaling measurement should be applied. E.g., measuring a sub macrosecond function using do_gettimeofday() is inappropreiate. Similarly, measuring a minute level function using getnstimeofday64() is also an overkill.
In most cases, when the piece of code measured is non-trivial and takes a substantial amount of time (e.g., miliseconds), these gettimeofday functions do their job. We usually ignore the delay caused by interrupt handlings, preemptions, and schedulings. However, when the code to be measured is really small (e.g., tens of assemblies), the delay caused by interrupt handlings, preemptions, and schedulings cannot be ignored anymore, since we are measuring a nanosecond level execution time. In this case, we need to disable (local) interrupt, and preemption, making sure the measurement is “atomic” without external interferences. The other thing we need is tsc (timestamp counter).
2. What is the right way to use tsc?
Here we ignore the mov instructions needed to read/merge the tsc values. Instead, we focus on different usage patterns of rdtsc and rdtscp. We explore these patterns and point out the potential issues.
Pattern A:
... A
... |
rdtsc; V
code_to_measure;
rdtsc; A
... |
... V
This might be the most straightforward way of using tsc. We call rdtsc to read the counter before and after the code_to_measure. Unfortunately, because of out of order execution of CPU, it is hard to guarantee that the code_to_measure within the rdtsc blocks is exactly the code we wanna measure. It turns out either the code belong to code_to_measure could be executed by the CPU ahead of the first rdtsc or after the second rdtsc, or whatever code before the first rdtsc or after the second rdtsc could be measured as part of the code_to_measure. In short, this pattern gives the measurement of timing(some_portion(code_to_measure) + some_portion(code_to_measure_before) + some_portion(code_to_measure_after)).
Pattern B:
... A
... X
... V
cpuid;
rdtsc;
code_to_measure;
cpuid;
rdtsc; | A
... X |
... | |
... V |
To surpress the effect of out of order execution, we need a processor barrier to guarantee that code_to_measure is exactly the code we wanna measure when we setup the tsc counters – no more or less (hopefully). CPUID is one such instruction that serializes the instruction flow. The first CPUID guarantees that all instructions before it have done; the second CPUID guarantees that code_to_measure is also done before we execute the second rdtsc. However, it is still possible that the code after the second rdtsc get executed after the second CPUID but before we hit the second rdtsc (we mark this as delta). Similarly, the first rdtsc, should be execuated as the first instruction once the CPUID is done. However, out of order execution within the CPUID block is still possible, making few instructions of code_to_measure execute before the first rdtsc (we mark this as the other delta). Fortunately, because rdtsc uses RAX and RDX the same, it is possible that the CPU would get stalled when it tries to excute instructions followed by rdtsc due to data hazard. Nevertheless, the biggest problem comes from the extra CPUID instruction measured. Per [1], the exeuction time of CPUID itself is NOT stable, ranging from tens of CPU cycles to hundreds. This poses issues when the code_to_measure has similar timing scale. In short, this pattern gives the measurement of timing(code_to_measure + cpuid – delta_1 + delta_2).
Pattern C:
... A
... X
... V
cpuid;
rdtsc;
cpuid;
code_to_measure;
cpuid;
rdtsc;
cpuid;
... A
... X
... V
Comparing with the previous pattern, we add CPUID after each rdtsc. The first added CPUID gurantees that rdtsc is executed before any instructions from code_to_measure; the last CPUID added guarantees that the last rdtsc is executed after code_to_measure is done and before whatever code followed. This solution elimites the measurement deltas in the price of the other extra CPUID in the measurement block. In short, this pattern gives the measurement of timing(code_to_measure + 2*cpuid).
Pattern D:
... A
... X
... V
cpuid;
rdtsc;
code_to_measure;
rdtsc;
cpuid;
... A
... X
... V
At this point, I hope it is clear that the first CPUID has to happen before the first rdtsc (otherwise we may include the code above into the measurement block since the first rdtsc is not guaranteed to get executed right before the CPUID). Moving the second rdtsc before the second CPUID gurantees that the code after the second CPUID will not get measured. There is also no extra CPUID added into the measurement block. Unfortunately, this does not work due to the possible out of order execution insided the CPUID block. In short, this pattern gives the measurement of timing(some_portion(code_to_measure)).
Pattern E:
... A
... X
... V
cpuid;
rdtsc;
code_to_measure;
rdtscp; | A
... X |
... | |
... V |
To deal with the measurement variance introduced by the usage of CPUID, rdtscp is introduced in later CPUs, which is essentially a combination of CPUID+rdtsc. This instruction is also able to guarantee that by the execution of itself, all the instructions before it have been done. Unfortunately, the instructions after rdtscp may still get executed before rdtscp itself. In short, this pattern gives the measurement of timing(code_to_measure – delta_1 + some_portion(code_after_rdtscp)).
Pattern F:
... A
... X
... V
cpuid;
rdtsc;
code_to_measure;
rdtscp;
cpuid;
... A
... X
... V
By appending another CPUID after rdtscp, now the code after rdtscp cannot get executed before rdtscp. In short, this pattern gives the measurement of timing(code_to_measure – delta_1). If we ignore delta here, this pattern provides the most “accurate” measurement so far.
Pattern G:
... X
... | A
... V |
rdtscp; |
code_to_measure;
rdtscp;
cpuid;
... A
... X
... V
Comparing to the previous pattern, we replaced the first CPUID+rdtsc with rdtscp. While the first rdtscp guarantees that all the instructions before itself have been done, it cannot stop the CPU from executing some code from code_to_measure before rdtscp. To fix this problem, we can add a CPUID right after the first rdtscp, which unfortunately defeats the purpose of avoiding CPUID inside the measurement block. In short, this pattern gives the measurement of timing(some_portion(code_to_measure)).
3. Discussion
There are other instructions like CPUID, which can serialize the instruction flow, such as lfense and mfense (which are also used to defend Meldown and Spectre attacks). For user space measurement using tsc, [2] provides examples. Note, however, the measurement from user space includes the noise from interrupt handling, preemption, and scheduling.
4. Conclusion
It is easy to use tsc, but hard to use it right. Even if we use it right (e.g., Pattern F), we still could not guarantee the measurement is 100% right, but we are getting pretty close. The kcetm [5] implements Pattern F in a kernel module and is free to use.
References:
[1] https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-32-ia-64-benchmark-code-execution-paper.pdf
[2] https://github.com/andyphillips/time_stamp_counters/blob/master/tsc.c
[3] https://nrich.maths.org/6070
[4] https://github.com/daveti/kcetm
Note that getnstimeofday64() has been deprecated and replaced by ktime_get_real_ts64().