• TSC is a register that increments on every CPU tick.
  • RDTSC is the assembly instruction that reads TSC into EDX:EAX.
  • RDTSCP is the assembly instruction that reads TSC into EDX:EAX and also reads the processor ID (IA32_TSC_AUX) into ECX. Equivalent to doing RDTSC + RDPID in a single instruction.

Benchmarking a critical session of code is as simple as

start = rdtsc() // get current counter value
.... beep boop, executing code ....
cycles = rdtsc() - start // theoretically the number of cycles

Sadly, reality has a surprising amount of detail.

Even a simple conversion from CPU cycle time to wall clock time is hairier than expected. It’s theoretically simple: take the clock speed of your processor (e.g. 4.4 GHz), take the reciprocal for the seconds per cycle (1 / 4.4 ~= 0.227 ns), and then multiply by the number of cycles.

However, modern CPUs have different clock speeds depending on the “mode” that they are in. If the CPU is idle, then it might actually be chugging along at 2.9GHz, rather than the stated 4.4Ghz.

That means cycles - as expressed in wall-clock time - may not be correct.

Not only that, modern CPUs execute instructions out-of-order. Just because you wrote A -> B -> C in your code doesn’t guarantee that the CPU will literally execute A -> B -> C.

This out-of-order-ness extends to our naive benchmarking code above. It’s possible that the second rdtsc() was executed before some portions of the main code block finished.

What’s the point of using RDTSCP? Does using the syscall get_clockttime help? Isn’t there overhead to syscalls? How does timestamping on the NIC level work? Is it different?

For now, I’m going to get some ice cream.

Resources