Wednesday, September 05, 2012

TechEd 2012 background 2 - Compiler optimizations and Volatile

Note: This goal of this post is to provide more in-depth information and reference for attendees about a presentation I'm giving at Tech Ed New Zealand 2012.
The session is DEV402 - Multithreaded programming in .NET from the ground up.

This second post shows one possible effect of compiler optimizations on a C# program, and explains how to fix it using the oft-misunderstood volatile keyword.

Here's a sample program which illustrates this.
This program looks suspiciously similar to one of the samples in Joseph Albahari's threading book because that was by far the best piece of code I found or could invent to best illustrate the issue. All credit due to Joseph for his great work

class Program
    static int stopFlag = 0;

    static void ThreadProc()
        Console.WriteLine("{0}: worker thread start", DateTime.Now.TimeOfDay);
        bool toggle = false;

        while (stopFlag == 0)
            toggle = !toggle;
        Console.WriteLine("{0}: worker thread done", DateTime.Now.TimeOfDay);

    public static void Main(string[] args)
        stopFlag = 0;
        var t = new Thread(ThreadProc);

        stopFlag = 1;


In summary, it's pretty simple. The main thread spins up a worker thread, sleeps for a second, then sets the stopFlag. The worker thread sits on a loop waiting for the stopFlag to be set, then exits.

If you run a Debug build of this code, or launch a 32-bit Release build with the debugger attached, you'll get this output:

10:38:25.0815053: worker thread start
10:38:26.0955150: worker thread done

However, if you launch a 32-bit Release build without the debugger attached, you'll see this:

10:51:47.1955922: worker thread start

The worker thread won't stop, and if you launch Task Manager you'll see that the app is thrashing away using an entire CPU. What's going on here?

To answer this, we need to attach the debugger to the already-running program, pause it and jump over to the worker thread and see what it's up to. The process for this in Visual Studio 2010/2012 goes like this:

  • From the Debug menu, chose Attach to Process...
  • Scroll down the list, select your application, and click Attach
  • Once the attach dialog closes, pause the application by clicking the pause button on the toolbar, or selecting Break All from the Debug menu. This will stop the program looping and let us look at what it's up to
  • Bring up the threads window, ( Debug / Windows / Threads ) and double-click on the worker thread

We should now see the current line of code that is being executed by this thread, which is

while (stopFlag == 0)

So, we're still in the loop waiting for the stopFlag to be set - yet, if you mouse over the stopFlag variable, or Add it to the Watch window, we'll see that it has a value of 1

The flag has been set, but our loop is still going. What gives?
To answer this, we need to bring up the Disassembly window. ( Debug / Windows / Disassembly ). This window shows us the raw x86 machine code that has been generated by the JIT compiler, and is the actual code that is being executed on the CPU.

The VS disassembly window will print each line of C# source code (if it has it), followed by the x86 assembly. For the line we're stuck on, we get this:

while (stopFlag == 0)
0000004a  mov         eax,dword ptr ds:[007A328Ch] 
0000004f  test        eax,eax 
00000051  jne         00000057 
00000053  test        eax,eax 
00000055  je          00000053 

If you know how to read x86 assembly, I congratulate you. I only know the basics, but enough that I can walk you through it and show what's going on.

So first, this line:

0000004a mov eax,dword ptr ds:[007A328Ch]

  • The instruction is mov which is x86 for "set something". The first parameter is the target, and the second is the source.
  • The First parameter is eax which refers to the EAX register inside the CPU. Intel x86 processors have 4 general purpose registers - EAX through EDX - that are used to perform the vast majority if work that happens in an application.
  • The Second parameter is dword ptr ds:[007A328Ch]. The dword ptr bit is a Size directive ( see this page and scroll down to "Size Directives" ) and it's needed to tell the CPU how many bytes we should be moving. dword ptr simply means 32 bits. I'm not entirely sure what the ds: bit means, but the most important part is this: [007A328Ch]. The square brackets mean "the value in memory at the given location". Our stop flag is static, so it's stored in a fixed memory location, in my case 007A328C

Translated: "Fetch 32 bits from memory at location 007A328C, and put them in the EAS register." In other words - load our stopflag into the CPU

The next part is this:
0000004f test eax,eax
00000051 jne 00000057

  • The test eax,eax is basically asking the question "Does EAX contain zero?"
  • The jne 00000057 is the Jump-if-not-equal instruction. It translates to "If it was NOT zero, jump to line 57

We know however, that when the loop starts, the flag was in fact zero, so we'll get past this block and onto the next one:

00000053 test eax,eax
00000055 je 00000053

  • Again, "Does EAX contain zero?"
  • The je 00000053 is the Jump-if-equal instruction. It translates to "If it WAS zero, jump to line 53, which is the start of our loop

Putting this all together, we end up with this sequence of events:

  • Load the stopflag from memory into EAX
  • Loop waiting for EAX to become non-zero

The problem is, at no point does any code ever set eax once we enter the loop.

What's going on here? The answer is that the C# and JIT compiler only consider the single-threaded case when doing optimizations. From the point of view of the WorkerThread, we never set the stopFlag, and we never call any other functions, so there is no possibility that the stopFlag can change. The JIT compiler can apply this, and decide that it only needs to read the stopFlag a single time, leading to this infinite loop!

To fix it, we simply mark the stopFlag variable as volatile.

This causes the assembly to change to this:

0000004a cmp dword ptr ds:[0088328Ch],0
00000051 jne 0000005C
00000053 cmp dword ptr ds:[0088328Ch],0
0000005a je 00000053

The key difference is that we see [0088328Ch] in the repetitive part of the loop. This means that at each iteration through the loop we go to memory and read the value. This means that when the stopFlag gets set after a second, we'll re-read it and the program behaves correctly. Great!


There are a couple of things I'd like to go into more detail about:

Firstly, the weird double-check that the compiler generates for loops.
When the JIT compiler generates a loop it always seems to use this pattern:

  • First, check the opposite of the loop condition. If this is true, jump over the entire loop code
  • Now we enter the loop where we do the repeated action, and check the actual loop condition

Technically, there's no reason why we need to check it twice, we could get by with just the second part where we check the positive loop condition. I've no expert, but from what I can guess this is probably to help with branch prediction.
My personal theory is that by splitting the loop into 2 conditions, the branch predictor can track them both seperately.
If we bail out of the loop before it even gets started, the branch predictor won't need to worry about the subsequent code, but if we enter the second phase of the loop, the odds are good that we'll keep looping, so the branch predictor can use this information.

Secondly, the use of the cmp instruction rather than test in the loops
This stackoverflow question/answer sheds some light on the topic

The test instruction is smaller (different x86 instructions take different numbers of bytes to encode) than cmp, so the compiler will prefer it over cmp where both would achieve the same result. test works by doing a bitwise AND on the two values, whereas cmp does subtraction.

In the first case, where the value has already been loaded into a register, ANDing it with itself is a nice fast way to check for zero. In the second case however, where the value is being read from memory every time, in order to use test, we'd have to either use an extra register to store it in, or read the memory twice for both sides of the test operation. Both of which would be slower than against the constant 0.

Thirdly, what happened to the toggling inside the loop?
You may have noticed that the disassembly at no point shows anything to do with the loop body of toggle = !toggle;.

This is just the compiler optimizer doing it's job.

toggle is a local variable, it's not returned, and it's not passed to any functions - therefore, whatever we do to it can have no possible effect on anything. The compiler detects this, and simply removes it entirely.

So, what does this mean about volatile?
There seems to be a lot of confusion around what the volatile keyword is supposed to do. There are many questions/answers and blog posts which attempt to explain it. Some say say that it will insert memory barriers either before or after writes to the variable, some say that it causes reads and writes to be translated to calls to the Thread.VolatileRead or VolatileWrite method (both of which insert explicit barriers - but as we can see from our disassembly, it isn't inserting any such barriers, it's just doing normal ordinary reads and writes.

I think a lot of this confusion comes from the MSDN volatile documentation. The first hit in google for "C# Volatile Keyword" is the MSDN documentation, which states The system always reads the current value of a volatile object at the point it is requested, even if the previous instruction asked for a value from the same object. Also, the value of the object is written immediately on assignment.

In order for this to be true (particularly the "value is written immediately" part) one might infer that the compiler would insert memory barriers. Unfortunately, this page is the documentation for Visual Studio 2003. Subsequent revisions (2005, 2010, etc) re-word the documentation to be less misleading, but unfortunately google is still linking everyone to the old out of date page! For more, see

Additionally, the ECMA CLI spec states that reads to volatile fields should have acquire semantics, and writes to volatile fields should have release semantics. This could also cause people to think that memory barriers might be neccessary, but reads and writes on x86 already have these semantics anyway, so the net effect of volatile (other than compiler optimizations) is... Nothing!. On ARM or Itanium however, the JIT will insert memory barriers around reads and writes to volatile fields to give you the acquire/release semantics, as those CPU's don't guarantee it natively.

Tuesday, September 04, 2012

TechEd 2012 background 1 - Race conditions and Interlocked operations

Note: This goal of this post is to provide more in-depth information and reference for attendees about a presentation I'm giving at Tech Ed New Zealand 2012.
The session is DEV402 - Multithreaded programming in .NET from the ground up.

This first post goes over what is happening to cause a basic race condition, caused by the interruption and interleaving of 2 threads accessing a shared variable.

Here is a buggy C# program. Feel free to copy/paste it into visual studio and run it yourself.
I'm guessing from what I've seen that probably 98% of the .NET code people are writing is running in 32-bit mode, so that's what I'm focusing on.

public static class BasicProgram
    public static int value = 0;

    const uint Iterations = 100000000;

    static System.Diagnostics.Stopwatch watch = new System.Diagnostics.Stopwatch();

    static Object x = new object();

    static void Main(string[] args)
        new Thread(Worker1) { IsBackground = true }.Start();
        new Thread(Worker2) { IsBackground = true }.Start();

        Console.WriteLine("Running... Press enter to quit");

    static void Worker2(object _)
        for (uint i = 0; i < Iterations; i++)

        Console.WriteLine("Worker2 done at {0}ms", watch.ElapsedMilliseconds);

    static void Worker1(object _)
        var oldValue = value; var newValue = value;

        for (uint i = 0; i < Iterations; i++)
            oldValue = value;
            newValue = value;

            if (newValue < oldValue)
                Console.WriteLine("i++ went backwards! oldValue={0}, newValue={1}!", oldValue, newValue);

        Console.WriteLine("Worker1 done at {0}ms", watch.ElapsedMilliseconds);

The bug manifests itself by hitting the sanity-check in Worker1. Usually for me this happens anywhere between 1 and 10 times, but sometimes it doesn't happen at all. The loop runs 100 million times, yet the bug only happens rarely - this is a good example of just how random and hard to reproduce these kinds of threading issues can be.

Here's the output for a random run on my laptop (dual-core Intel i5 with hyperthreading):

Running... Press enter to quit
i++ went backwards! oldValue=35777960, newValue=35777925!
Worker2 done at 266ms
Worker1 done at 402ms

This is pretty much the standard textbook race condition.

I'd imagine most developers with a bit of experience will have seen this or something similar before, and so I don't think the purely educational value for this is that high, but my main goal is to use it this as a starting point.
It teaches (or reinforces) several fundamental key points that I want people to have in their minds, so I can build upon these points to explain more complex threading bugs.

These fundamental lessons are:

  • In order to do any work, the computer has to Read from memory, Modify the value inside the CPU, then Write back to memory. These are 3 (or more) discrete steps.
  • Because of this, simple-looking things such as value++ actually turn out to be multi-step operations
  • The OS will interrupt the operation of your threads at arbitrary points... If a thread is in the middle of a multi-step operation, this may mean another thread can come along while it's suspended, and modify data out from underneath it. This is the root cause of the bug.

You can fix this by using a lock - The idea of course is that if you acquire a lock before doing the read-modify-write, then other threads must wait for you to release the lock before they can touch the shared value. If a thread gets suspended in the middle of this read-modify-write operation, well, other threads must wait for it to be resumed so it can release the lock, and the whole "modifying data out from underneath it" simply can't happen.

This is great, and I want to make the point that normally this is exactly the right thing to do, but there is also another way to solve this problem - by using Atomic Operations.

What are atomic operations? All CPU's read and write to memory in certain chunks of bytes

  • A 32-bit processor (or a 64-bit processor running in 32-bit mode) will read or write to memory in blocks of 32-bits
  • A 64-bit processor running in 64-bit mode will read or write to memory in blocks of 64-bits

Handily, this is the same size as a pointer in native C or C++ code for those platforms, and .NET Object references are just pointers. (You don't get to see the pointers directly, they are managed for you, hence why .NET is a "managed" language - but they're still there in exactly the same way they would be in a C++ program)

Additionally, there's the functions in System.Threading.Interlocked

What I don't mention in the talk is memory alignment. Memory alignment affects atomicity, as if your 32-bit int is split over 2 blocks due to alignment issues, you will have to read/write both those blocks, and hence it won't be atomic.

I chose not to mention this because pretty much the only way to get unaligned memory access in .NET is when fiddling with a struct that uses the [StructLayout] attribute , which only really happens for P/Invoke calls.
The remaining 99.9% of .NET code is likely to have properly aligned memory access patterns, so I didn't think it worth spending time on during the talk.

Before applying the interlocked fix, first we can see what happens by simply applying a lock.

value++; becomes lock(x){ value++; }

and we get this output:

Running... Press enter to quit
Worker2 done at 3401ms
Worker1 done at 5539ms

Let's compare that with Interlocked Increment:

value++; becomes Interlocked.Increment(ref value);

Here's the output:

Running... Press enter to quit
Worker2 done at 2269ms
Worker1 done at 2276ms

Hooray! No more bugs. You'll notice however, that it's around twice as fast as using a lock, but still much slower than no synchronization at all - Why does it behave like this? Let's dig in and find out.

If we view the disassembly produced by a call to Interlocked.Increment, we see this:

lock inc dword ptr ds:[00B2328Ch]

  • Our value variable is static, so it will be stored at a fixed location in memory. This location happens to be 00B2328C for me.
  • The dword ptr bit before it is a Size Directive ( see this page and scroll down to "Size Directives" ) - we need to tell the inc instruction how big the thing we're incrementing is - dword ptr basically means 32 bits.
  • The inc instruction does what it says it does... increments a value
The magic comes from the lock prefix. Here's a stackoverflow question with a good answer explaining it. Basically what this does is tell the CPU to lock the memory bus for the duration of the operation. As such the read-increment-write operation can't be interrupted.

So essentially, Interlocked.XYZ operations actually are just using locks, but the difference is that they are CPU-level hardware locks, rather than the normal software locks we're used to. Pretty cool huh. It also explains why the Interlocked version takes longer - all that locking and unlocking of the memory bus takes time, and additionally if one CPU locks it, the other one has to wait, just like a normal lock.

So, what about the lock version? What makes it twice as slow as Interlocked?
Microsoft makes available the Shared Source CLI 2.0. This is a "reference implementation" of .NET version 2.0, and it's sufficiently similar to .NET 4.0 and 4.5 in most areas that we can consider it to be pretty much correct. Think of it as the source code for the core native C++ parts of .NET. The garbage collector, JIT compiler, etc.

If you download this and dig through it, you'll find the code that implements the native CLR lock in \clr\src\vm\syncblk.cpp. I'll spare you the effort of having to parse it all - but what it comes down to is that in the best-case scenario (no lock contention, etc), a .NET lock must do at least one InterlockedCompareExchange to acquire the lock, and another InterlockedCompareExchange to release it.

There's obviously a lot more going on, but it's pretty easy to infer from that why the performance is different. A lock is doing two interlocked operations, and Interlocked.Increment is only doing one. This lines up quite nicely with our observed results.

I hope you enjoyed or learned something, I know I certainly had fun figuring all this stuff out. I will follow this post up with more in-depth details of two other causes of threading bugs - compiler optimizations, and memory re-ordering