Last time, I was asking myself what the point of using RDTSCP was. Why shouldn’t I just call RDTSC and call it a day?

How to Benchmark Code Execution Times (Intel Guide) explains the problem:

When we call the RDTSC instruction, we pretend that the instruction will be executed exactly at the beginning and at the end of code being measured. The solution is to call a serializing instruction before calling the RDTSC one. A serializing instruction is an instruction that forces the CPU to complete every preceding instruction… before continuing the program execution.

… which is to say that - it’s very likely that setting up RDTSC “fences” and taking the delta isn’t measuring what you intend to measure.

The serialization instruction du jour is CPUID: coincidentally the latter half of RDTSCP.

RDTSCP unfortunately has a “pseudo” serializing property:

The RDTSCP instruction waits until all previous instructions have been executed before reading the counter. However, subsequent instructions may begin execution before the read operation is performed.

… which means that you may measure more instructions than you intend.

The manual suggests putting yet another CPUID at the very end of your code you are trying to benchmark - something like:

CPUID
RDTSC
<your code here>
RDTSCP
CPUID

That way, you prevent scenarios where other code that’s theoretically occurring after the RDTSCP call is executed before the RDTSCP clock read.

One question though: why don’t they use RDTSCP for the start time checkpointing? Doesn’t the CPUID at the end ensure that instructions are run serially?

This and more timestamping adventures next time - after I get my ice cream.