IT Hare: Infographic - Operation Costs on CPU Clock Cycles

A place to discuss the science of computers and programs, from algorithms to computability.

Formal proofs preferred.

Moderators: phlip, Moderators General, Prelates

Posts: 2
Joined: Fri Jan 22, 2016 6:32 pm UTC

IT Hare: Infographic - Operation Costs on CPU Clock Cycles

Postby Rafferty » Mon Sep 12, 2016 3:30 pm UTC ... ck-cycles/

This article opens with a graphic to show the cost of certain CPU operations. To me, it looks like an XKCD cartoon just waiting to happen.

The writer is also looking for feedback to make the chart more accurate, there's links at the site.

-- R.

User avatar
You have done something you shouldn't. Or are about to.
Posts: 4060
Joined: Thu Jan 28, 2016 7:00 pm UTC
Location: 53-1

Re: IT Hare: Infographic - Operation Costs on CPU Clock Cycles

Postby Soupspoon » Mon Sep 12, 2016 5:20 pm UTC


Couldn't find NOP, though...

Posts: 495
Joined: Sun Jul 04, 2010 8:40 pm UTC

Re: IT Hare: Infographic - Operation Costs on CPU Clock Cycles

Postby korona » Sun Sep 18, 2016 9:07 am UTC

Keep in mind that while it might be possible to specify the latency of individual operations on such a chart it is very difficult to estimate the performance of combinations of instructions. CPUs do not execute instructions sequentially. Also: The costs of individual operations differ by CPU architecture and model. In this chart function call costs might be accurate for x86 cdecl but calls are much faster on x86_64 (on the order of a correctly predicted jump + memory write). syscall costs also vary between x86 and x86_64. Thread switches are much faster if PCIDs are used etc.

EDIT3: The method of measuring a thread switch as sizeof(L3) * miss-cost(L3) seems to be pretty stupid. Yes, that can happen if you're running two compute-heavy threads that saturate memory bandwidth on the same CPU. But then you're already overcommiting which don't want to do anyways if you care about performance. A MUCH more acurate measurement would be the cost of a TLB flush. The situation thread wake -> does a bit of work -> sleeps again is much more common than thread 1 wake -> exhaust entire time slice -> thread preempted -> thread 2 wake -> exhaust entire time slice -> ...

A problem of the chart is that is uses numbers from various sources that might all be individually correct but are from very different moments in CPU history.

To get a more precise measurement: "perf stat" on Linux + modern Intel CPUs gives a nice overview of your programs performance. I use it do profile some performance critical application at work. For small workloads (i.e. fitting into L3 cache) the program actually archives 1.2 instructions / cycle. I guess that this number could be way higher for codes that perform linear SIMD operations on a large amount of data (our program is quite branch-heavy and often saturates memory bandwidth).

EDIT: Also note for things like L1 accesses: Yes, an individual L1 access might have a latency of 4 cycles but those accesses are pipelined so that the processor is able to read data much faster than once per 4 cycles. Otherwise it would not even be possible to fetch one instruction / cycle and the CPU would be idle most of the time in linear pipeline-friendly codes.

EDIT2: I also heavily disagree with the use of always_inline for optimizing code (it's a nice tool for forcing inlining in debug builds though and in some low level code it might be needed to access the correct stack frame using things like __builtin_return_address). I have yet to see some code where always_inline improves performance. Compilers do not ignore inline for fun but because they are better at estimating what needs to be inlined than programmers (in C and especially C++ it's quite hard to estimate the L1 footprint of generated code).

User avatar
Posts: 152
Joined: Sat Feb 16, 2013 1:46 pm UTC

Re: IT Hare: Infographic - Operation Costs on CPU Clock Cycles

Postby 3rdtry » Mon Sep 19, 2016 6:17 pm UTC

Would it be possible to reduce context switching cost by switching to a cooperative model? (assuming the code was trusted)

As I understand, context switching means:
- Store registers to memory (including instruction pointer)
- Change page table
- Load other process' registers to memory (including instruction pointer)

If the compiler sprinkled "interruption points" in "evenly spaced" points of the program (yes, I know that's more difficult than it sounds, but let's pretend) that stored registers to memory and jumped to a predefined memory address containing the program scheduler, and the linker used different base memory addresses for every process, you could have "soft context switches" which would be just a bit more expensive than function calls.

But, since most of the cost seems to come from cache misses, this wouldn't actually help much, would it?

Posts: 495
Joined: Sun Jul 04, 2010 8:40 pm UTC

Re: IT Hare: Infographic - Operation Costs on CPU Clock Cycles

Postby korona » Tue Sep 20, 2016 7:11 am UTC

Yes, context switch costs could be reduced by such mechanisms. However the effectiveness varies by approach.

There are several costs associated with context switches:
  • There is a switch to supervisor mode. This usually includes a pipeline flush. The cost of the switch may vary between different entry mechanisms. On x86_64 syscall is much faster than an interrupt.
  • General purpose registers need to be saved. This is not very costly.
  • FPU and other special purpose registers (e.g. tracing, performance monitoring and debug registers) need to be saved. This will generally be much slower than saving GPRs. On recent processors FPU state might be somewhat large. AVX-512 has 16 512 bit registers.
  • Architecture specific registers need to saved. For example on x86_64 the FSBASE/GSBASE MSRs might need to be saved. Reading MSRs is quite costly.
  • The address space needs to be swapped. On processors without ASIDs/PCIDs this includes a TLB flush.
  • Because the target thread has a different working set caches might be stale.
The costs of the first four points can certainly be reduced by cooperation. In order to reduce the cost of the address space switch threads would need to run in the same address space which is only possible for processes that trust each other.

Return to “Computer Science”

Who is online

Users browsing this forum: No registered users and 4 guests