Tuesday, December 04, 2012

Run time costs of small operations in C#.

Thank you smart people that helped me understand this: Vance Morrison ; Kevin Frei; Dave Detlefs; Jan Kotas; Sean Selitrennikoff; Nathan Iarovich; Dave Driver.

One of my colleagues read this post and pointed out for really small operations e.g  {i++ } the cost of the measurement system is exceeding the cost of the operations. I agree completely,  and recommend using measure it and reading the help if you need to measure small things.

My colleague also noticed something interesting; in this code  (full  code here, proof of concept in linqpad using measureit.):
     class Program
    {
        static int interlocked = 0;
        static int staticInt = 0;

        static void DoNop() { }
        static void IncrementStaticInt() { staticInt++; }

        // This is an array to make it easier to debug in cdb.
        static NamedAction[] namedActions;

        static void Main(string[] args)
        {
            int localInt = 0;
            namedActions = new NamedAction[]
            {
                new NamedAction {name="lambda function: InterlockedIncrement()", action = ()=>Interlocked.Add(ref interlocked,interlocked)},
                new NamedAction {name="static function: staticInt++", action = IncrementStaticInt},
                new NamedAction {name="static function: nop", action = DoNop},
                new NamedAction {name="lambda function: staticInt++", action = ()=>{staticInt++;}},
                new NamedAction {name="lambda function: localInt++", action = ()=>{localInt++;}},
                new NamedAction {name="lambda function: nop", action = ()=>{}}
            };

            int Million = 1000 * 1000;
            int iterations = 10 * Million;
            Console.WriteLine("Milliseconds to perform {0} iterations", iterations);
            ...
We get the  surprising result, incrementing a local number is faster then doing nothing:
 C:\hgs\ig2600_blog\PerformanceOfSmallThings\bin\Release>PerformanceOfSmallThings.exe
Milliseconds to perform 10000000 iterations
lambda function: InterlockedIncrement() took 221.1698 ms
static function: staticInt++ took 47.3857 ms
static function: nop took 43.1259 ms
lambda function: nop took 43.0301 ms
lambda function: staticInt++ took 42.7559 ms
lambda function: localInt++ took 35.0208 ms

What the heck is going on?  Why would incrementing a local  variable be faster then running nop, and why would incrementing a static variable take the same as NOP? To answer that we need to see the assembly code.  To find the assembly code read this.

Here's what the assembly looks like:
staticInt++:
000007f9`7fd40940 ff05262deeff    inc     dword ptr [000007f9`7fc2366c]
000007f9`7fd40946 c3              ret
static nop:
000007f9`7fd40960 c3              ret
lambda nop:
000007f9`7fd50950 c3              ret
localInt++:
000007f9`7fd50930 8b4108          mov     eax,dword ptr [rcx+8]
000007f9`7fd50933 ffc0            inc     eax
000007f9`7fd50935 894108          mov     dword ptr [rcx+8],eax
000007f9`7fd50938 c3              ret

Maybe the caller is different - lets check:
Breakpoint 0 hit
000007f9`7fd50930 8b4108          mov     eax,dword ptr [rcx+8] ds:000000fa`9f582fb8=01312d0d
0:000> k
Child-SP          RetAddr           Call Site
000000fa`9d37e9a8 000007f9`d32d3c5c 0x7f9`7fd50930
000000fa`9d37e9b0 000007f9`7fd506a8 mscorlib_ni+0x1263c5c
000000fa`9d37ea00 000007f9`df419133 0x7f9`7fd506a8
0:000> g

Breakpoint 1 hit
000007f9`7fd50950 c3              ret
0:000> k
Child-SP          RetAddr           Call Site
000000fa`9d37e9a8 000007f9`d32d3c5c 0x7f9`7fd50950
000000fa`9d37e9b0 000007f9`7fd506a8 mscorlib_ni+0x1263c5c
000000fa`9d37ea00 000007f9`df419133 0x7f9`7fd506a8

Nope! Now what? We need to trace execution from the caller to our function. breakpoint caller and trace execution.
 0:000> ub 000007f9`d32d3c5c
mscorlib_ni+0x1263c48:
000007f9`d32d3c48 94              xchg    eax,esp
000007f9`d32d3c49 ab              stos    dword ptr [rdi]
000007f9`d32d3c4a 2aff            sub     bh,bh
000007f9`d32d3c4c 3c00            cmp     al,0
000007f9`d32d3c4e 0000            add     byte ptr [rax],al
000007f9`d32d3c50 488b54ca18      mov     rdx,qword ptr [rdx+rcx*8+18h]
000007f9`d32d3c55 488b4f08        mov     rcx,qword ptr [rdi+8]
000007f9`d32d3c59 ff5718          call    qword ptr [rdi+18h]

0:000> bp 000007f9`d32d3c59

0:000> g
Breakpoint 2 hit
0:000> ta
000000fa`9d9c0960 488bc4          mov     rax,rsp
000000fa`9d9c0963 4c8bd9          mov     r11,rcx
000000fa`9d9c0966 4889d1          mov     rcx,rdx
000000fa`9d9c0969 4d8b5320        mov     r10,qword ptr [r11+20h] ds:000000fa`9f5897e8=000007f97fc3c070
000000fa`9d9c096d 4983c320        add     r11,20h
000000fa`9d9c0971 49ffe2          jmp     r10 {000007f9`7fc3c070}
000007f9`7fc3c070 e99b501100      jmp     000007f9`7fd51110
000007f9`7fd51110 488b4110        mov     rax,qword ptr [rcx+10h] ds:000000fa`9f583068=000000fa9f583078
000007f9`7fd51114 488b4808        mov     rcx,qword ptr [rax+8] ds:000000fa`9f583080=000000fa9f583078
000007f9`7fd51118 488b4018        mov     rax,qword ptr [rax+18h] ds:000000fa`9f583090=000000fa9d9c0900
000007f9`7fd5111c 48ffe0          jmp     rax {000000fa`9d9c0900}
000000fa`9d9c0900 488bc4          mov     rax,rsp
000000fa`9d9c0903 4c8bd9          mov     r11,rcx
000000fa`9d9c0906 4d8b5320        mov     r10,qword ptr [r11+20h] ds:000000fa`9f583098=000007f97fc3c040
000000fa`9d9c090a 4983c320        add     r11,20h
000000fa`9d9c090e 49ffe2          jmp     r10 {000007f9`7fc3c040}
000007f9`7fc3c040 e90b491100      jmp     000007f9`7fd50950
Breakpoint 1 hit
000007f9`7fd50950 c3              ret

0:000> g
Breakpoint 2 hit
mscorlib_ni+0x1263c59:
000007f9`d32d3c59 ff5718          call    qword ptr [rdi+18h] ds:000000fa`9f5897e0=000000fa9d9c0960
0:000> ta
000000fa`9d9c0960 488bc4          mov     rax,rsp
000000fa`9d9c0963 4c8bd9          mov     r11,rcx
000000fa`9d9c0966 4889d1          mov     rcx,rdx
000000fa`9d9c0969 4d8b5320        mov     r10,qword ptr [r11+20h] ds:000000fa`9f5897e8=000007f97fc3c070
000000fa`9d9c096d 4983c320        add     r11,20h
000000fa`9d9c0971 49ffe2          jmp     r10 {000007f9`7fc3c070}
000007f9`7fc3c070 e99b501100      jmp     000007f9`7fd51110
000007f9`7fd51110 488b4110        mov     rax,qword ptr [rcx+10h] ds:000000fa`9f583008=000000fa9f583018
000007f9`7fd51114 488b4808        mov     rcx,qword ptr [rax+8] ds:000000fa`9f583020=000000fa9f582fb0
000007f9`7fd51118 488b4018        mov     rax,qword ptr [rax+18h] ds:000000fa`9f583030=000007f97fc3c1e8
000007f9`7fd5111c 48ffe0          jmp     rax {000007f9`7fc3c1e8}
000007f9`7fc3c1e8 e943471100      jmp     000007f9`7fd50930
Breakpoint 0 hit
000007f9`7fd50930 8b4108          mov     eax,dword ptr [rcx+8] ds:000000fa`9f582fb8=01312d0d
0:000>
Yeah, I know painful to read - lets look at that in VIMDiff:


What do we see? When running the nop function we have 5 extra instructions. Those instructions are a shuffle thunk (see Figure 2 here), and is required when calling static delegates.

OK so why is incrementing a local number is faster then doing nothing? Because we're not measuring those things, we're measuring the cost of delegate invocation. In this case, we're actually comparing the cost of static vs instance delegates, and instance is faster.   You see there are two performance optimizations in play interacting  from an execution time perspective:

1) JIT makes instance delegates faster then statics delegates (I assume it's the common case).
2) Compiler optimizes nop to be a static delegate (I assume to save a potential object instantiation).

3 comments:

Konstantin Savelev said...

It is very hard to write correct microbenchmarks.

I'm not professional in writing them but if I were you I would make sure that I run my tests many times and measured mean and RMS. I would also make sure that no other program took processor time while tests were running. I would check that no GC happened.... and so on

Before doing this I wouldn't consider the numbers reliable.

BTW In the "lambda function: localInt++" test you actually increment _field_ on the object that compiler creates by doing closure.

Igor Dvorkin said...

Konstantin, excelling point on the incrementing a field - I'll update in the next round of the post. I also suspect these numbers are reliable from a statistical perspective and will get to the bottom of this :)

kevinfrei said...

You're missing the most important part of a microbenchmark: What hardware are you running it on? I can certainly tell that it's 64 bit, but are you running on AMD or Intel hardware? Which exact part? There are a whole bunch of weird glass jaws that you can hit in microbenchmarks, but to diagnose it, I need hardware detail.

-Kev
(dev lead for the .NET JIT compiler team)