TSCis a register that increments on every CPU tick.RDTSCis the assembly instruction that readsTSCintoEDX:EAX.RDTSCPis the assembly instruction that readsTSCintoEDX:EAXand also reads the processor ID (IA32_TSC_AUX) intoECX. Equivalent to doingRDTSC+RDPIDin a single instruction.
Benchmarking a critical session of code is as simple as
start = rdtsc() // get current counter value
.... beep boop, executing code ....
cycles = rdtsc() - start // theoretically the number of cycles
Sadly, reality has a surprising amount of detail.
Even a simple conversion from CPU cycle time to wall clock time is hairier than expected. It’s theoretically simple: take the clock speed of your processor (e.g. 4.4 GHz), take the reciprocal for the seconds per cycle (1 / 4.4 ~= 0.227 ns), and then multiply by the number of cycles.
However, modern CPUs have different clock speeds depending on the “mode” that they are in. If the CPU is idle, then it might actually be chugging along at 2.9GHz, rather than the stated 4.4Ghz.
That means cycles - as expressed in wall-clock time - may not be correct.
Not only that, modern CPUs execute instructions out-of-order. Just because you wrote A -> B -> C in your code doesn’t guarantee that the CPU will literally execute A -> B -> C.
This out-of-order-ness extends to our naive benchmarking code above. It’s possible that the second rdtsc() was executed before some portions of the main code block finished.
What’s the point of using RDTSCP?
Does using the syscall get_clockttime help? Isn’t there overhead to syscalls?
How does timestamping on the NIC level work? Is it different?
For now, I’m going to get some ice cream.
Resources
- How to Benchmark Code Execution Times (Intel Guide)
- Linux vDSO Overview
- Using the TSC in C
- Go Runtime clock_gettime Implementation
- RDTSC Instruction Reference
- RDTSCP Instruction Reference
- RDPID Instruction Reference
- Faster Equivalent of gettimeofday()
- Idiomatic Performance Evaluation
- Getting CPU Cycle Count in x86-64
- Fastest Way to Get a Timestamp