Run time costs of small operations in C#.
Thank you smart people that helped me understand this: Vance Morrison ; Kevin Frei; Dave Detlefs; Jan Kotas; Sean Selitrennikoff; Nathan Iarovich; Dave Driver.
One of my colleagues read this post and pointed out for really small operations e.g {i++ } the cost of the measurement system is exceeding the cost of the operations. I agree completely, and recommend using measure it and reading the help if you need to measure small things.
My colleague also noticed something interesting; in this code (full code here, proof of concept in linqpad using measureit.):
One of my colleagues read this post and pointed out for really small operations e.g {i++ } the cost of the measurement system is exceeding the cost of the operations. I agree completely, and recommend using measure it and reading the help if you need to measure small things.
My colleague also noticed something interesting; in this code (full code here, proof of concept in linqpad using measureit.):
class Program { static int interlocked = 0; static int staticInt = 0; static void DoNop() { } static void IncrementStaticInt() { staticInt++; } // This is an array to make it easier to debug in cdb. static NamedAction[] namedActions; static void Main(string[] args) { int localInt = 0; namedActions = new NamedAction[] { new NamedAction {name="lambda function: InterlockedIncrement()", action = ()=>Interlocked.Add(ref interlocked,interlocked)}, new NamedAction {name="static function: staticInt++", action = IncrementStaticInt}, new NamedAction {name="static function: nop", action = DoNop}, new NamedAction {name="lambda function: staticInt++", action = ()=>{staticInt++;}}, new NamedAction {name="lambda function: localInt++", action = ()=>{localInt++;}}, new NamedAction {name="lambda function: nop", action = ()=>{}} }; int Million = 1000 * 1000; int iterations = 10 * Million; Console.WriteLine("Milliseconds to perform {0} iterations", iterations); ...We get the surprising result, incrementing a local number is faster then doing nothing:
C:\hgs\ig2600_blog\PerformanceOfSmallThings\bin\Release>PerformanceOfSmallThings.exe Milliseconds to perform 10000000 iterations lambda function: InterlockedIncrement() took 221.1698 ms static function: staticInt++ took 47.3857 ms static function: nop took 43.1259 ms lambda function: nop took 43.0301 ms lambda function: staticInt++ took 42.7559 ms lambda function: localInt++ took 35.0208 ms
What the heck is going on? Why would incrementing a local variable be faster then running nop, and why would incrementing a static variable take the same as NOP? To answer that we need to see the assembly code. To find the assembly code read this.
Here's what the assembly looks like:
staticInt++:
Maybe the caller is different - lets check:
Nope! Now what? We need to trace execution from the caller to our function. breakpoint caller and trace execution.
What do we see? When running the nop function we have 5 extra instructions. Those instructions are a shuffle thunk (see Figure 2 here), and is required when calling static delegates.
OK so why is incrementing a local number is faster then doing nothing? Because we're not measuring those things, we're measuring the cost of delegate invocation. In this case, we're actually comparing the cost of static vs instance delegates, and instance is faster. You see there are two performance optimizations in play interacting from an execution time perspective:
1) JIT makes instance delegates faster then statics delegates (I assume it's the common case).
2) Compiler optimizes nop to be a static delegate (I assume to save a potential object instantiation).
staticInt++:
000007f9`7fd40940 ff05262deeff inc dword ptr [000007f9`7fc2366c] 000007f9`7fd40946 c3 retstatic nop:
000007f9`7fd40960 c3 retlambda nop:
000007f9`7fd50950 c3 retlocalInt++:
000007f9`7fd50930 8b4108 mov eax,dword ptr [rcx+8] 000007f9`7fd50933 ffc0 inc eax 000007f9`7fd50935 894108 mov dword ptr [rcx+8],eax 000007f9`7fd50938 c3 ret
Maybe the caller is different - lets check:
Breakpoint 0 hit 000007f9`7fd50930 8b4108 mov eax,dword ptr [rcx+8] ds:000000fa`9f582fb8=01312d0d 0:000> k Child-SP RetAddr Call Site 000000fa`9d37e9a8 000007f9`d32d3c5c 0x7f9`7fd50930 000000fa`9d37e9b0 000007f9`7fd506a8 mscorlib_ni+0x1263c5c 000000fa`9d37ea00 000007f9`df419133 0x7f9`7fd506a8 0:000> g Breakpoint 1 hit 000007f9`7fd50950 c3 ret 0:000> k Child-SP RetAddr Call Site 000000fa`9d37e9a8 000007f9`d32d3c5c 0x7f9`7fd50950 000000fa`9d37e9b0 000007f9`7fd506a8 mscorlib_ni+0x1263c5c 000000fa`9d37ea00 000007f9`df419133 0x7f9`7fd506a8
Nope! Now what? We need to trace execution from the caller to our function. breakpoint caller and trace execution.
0:000> ub 000007f9`d32d3c5c mscorlib_ni+0x1263c48: 000007f9`d32d3c48 94 xchg eax,esp 000007f9`d32d3c49 ab stos dword ptr [rdi] 000007f9`d32d3c4a 2aff sub bh,bh 000007f9`d32d3c4c 3c00 cmp al,0 000007f9`d32d3c4e 0000 add byte ptr [rax],al 000007f9`d32d3c50 488b54ca18 mov rdx,qword ptr [rdx+rcx*8+18h] 000007f9`d32d3c55 488b4f08 mov rcx,qword ptr [rdi+8] 000007f9`d32d3c59 ff5718 call qword ptr [rdi+18h] 0:000> bp 000007f9`d32d3c59 0:000> g Breakpoint 2 hit 0:000> ta 000000fa`9d9c0960 488bc4 mov rax,rsp 000000fa`9d9c0963 4c8bd9 mov r11,rcx 000000fa`9d9c0966 4889d1 mov rcx,rdx 000000fa`9d9c0969 4d8b5320 mov r10,qword ptr [r11+20h] ds:000000fa`9f5897e8=000007f97fc3c070 000000fa`9d9c096d 4983c320 add r11,20h 000000fa`9d9c0971 49ffe2 jmp r10 {000007f9`7fc3c070} 000007f9`7fc3c070 e99b501100 jmp 000007f9`7fd51110 000007f9`7fd51110 488b4110 mov rax,qword ptr [rcx+10h] ds:000000fa`9f583068=000000fa9f583078 000007f9`7fd51114 488b4808 mov rcx,qword ptr [rax+8] ds:000000fa`9f583080=000000fa9f583078 000007f9`7fd51118 488b4018 mov rax,qword ptr [rax+18h] ds:000000fa`9f583090=000000fa9d9c0900 000007f9`7fd5111c 48ffe0 jmp rax {000000fa`9d9c0900} 000000fa`9d9c0900 488bc4 mov rax,rsp 000000fa`9d9c0903 4c8bd9 mov r11,rcx 000000fa`9d9c0906 4d8b5320 mov r10,qword ptr [r11+20h] ds:000000fa`9f583098=000007f97fc3c040 000000fa`9d9c090a 4983c320 add r11,20h 000000fa`9d9c090e 49ffe2 jmp r10 {000007f9`7fc3c040} 000007f9`7fc3c040 e90b491100 jmp 000007f9`7fd50950 Breakpoint 1 hit 000007f9`7fd50950 c3 ret 0:000> g Breakpoint 2 hit mscorlib_ni+0x1263c59: 000007f9`d32d3c59 ff5718 call qword ptr [rdi+18h] ds:000000fa`9f5897e0=000000fa9d9c0960 0:000> ta 000000fa`9d9c0960 488bc4 mov rax,rsp 000000fa`9d9c0963 4c8bd9 mov r11,rcx 000000fa`9d9c0966 4889d1 mov rcx,rdx 000000fa`9d9c0969 4d8b5320 mov r10,qword ptr [r11+20h] ds:000000fa`9f5897e8=000007f97fc3c070 000000fa`9d9c096d 4983c320 add r11,20h 000000fa`9d9c0971 49ffe2 jmp r10 {000007f9`7fc3c070} 000007f9`7fc3c070 e99b501100 jmp 000007f9`7fd51110 000007f9`7fd51110 488b4110 mov rax,qword ptr [rcx+10h] ds:000000fa`9f583008=000000fa9f583018 000007f9`7fd51114 488b4808 mov rcx,qword ptr [rax+8] ds:000000fa`9f583020=000000fa9f582fb0 000007f9`7fd51118 488b4018 mov rax,qword ptr [rax+18h] ds:000000fa`9f583030=000007f97fc3c1e8 000007f9`7fd5111c 48ffe0 jmp rax {000007f9`7fc3c1e8} 000007f9`7fc3c1e8 e943471100 jmp 000007f9`7fd50930 Breakpoint 0 hit 000007f9`7fd50930 8b4108 mov eax,dword ptr [rcx+8] ds:000000fa`9f582fb8=01312d0d 0:000>Yeah, I know painful to read - lets look at that in VIMDiff:
What do we see? When running the nop function we have 5 extra instructions. Those instructions are a shuffle thunk (see Figure 2 here), and is required when calling static delegates.
OK so why is incrementing a local number is faster then doing nothing? Because we're not measuring those things, we're measuring the cost of delegate invocation. In this case, we're actually comparing the cost of static vs instance delegates, and instance is faster. You see there are two performance optimizations in play interacting from an execution time perspective:
1) JIT makes instance delegates faster then statics delegates (I assume it's the common case).
2) Compiler optimizes nop to be a static delegate (I assume to save a potential object instantiation).
Comments
I'm not professional in writing them but if I were you I would make sure that I run my tests many times and measured mean and RMS. I would also make sure that no other program took processor time while tests were running. I would check that no GC happened.... and so on
Before doing this I wouldn't consider the numbers reliable.
BTW In the "lambda function: localInt++" test you actually increment _field_ on the object that compiler creates by doing closure.
-Kev
(dev lead for the .NET JIT compiler team)