Wednesday, September 05, 2012

TechEd 2012 background 2 - Compiler optimizations and Volatile

Note: This goal of this post is to provide more in-depth information and reference for attendees about a presentation I'm giving at Tech Ed New Zealand 2012.
The session is DEV402 - Multithreaded programming in .NET from the ground up.

This second post shows one possible effect of compiler optimizations on a C# program, and explains how to fix it using the oft-misunderstood volatile keyword.

Here's a sample program which illustrates this.
This program looks suspiciously similar to one of the samples in Joseph Albahari's threading book because that was by far the best piece of code I found or could invent to best illustrate the issue. All credit due to Joseph for his great work

class Program
    static int stopFlag = 0;

    static void ThreadProc()
        Console.WriteLine("{0}: worker thread start", DateTime.Now.TimeOfDay);
        bool toggle = false;

        while (stopFlag == 0)
            toggle = !toggle;
        Console.WriteLine("{0}: worker thread done", DateTime.Now.TimeOfDay);

    public static void Main(string[] args)
        stopFlag = 0;
        var t = new Thread(ThreadProc);

        stopFlag = 1;


In summary, it's pretty simple. The main thread spins up a worker thread, sleeps for a second, then sets the stopFlag. The worker thread sits on a loop waiting for the stopFlag to be set, then exits.

If you run a Debug build of this code, or launch a 32-bit Release build with the debugger attached, you'll get this output:

10:38:25.0815053: worker thread start
10:38:26.0955150: worker thread done

However, if you launch a 32-bit Release build without the debugger attached, you'll see this:

10:51:47.1955922: worker thread start

The worker thread won't stop, and if you launch Task Manager you'll see that the app is thrashing away using an entire CPU. What's going on here?

To answer this, we need to attach the debugger to the already-running program, pause it and jump over to the worker thread and see what it's up to. The process for this in Visual Studio 2010/2012 goes like this:

  • From the Debug menu, chose Attach to Process...
  • Scroll down the list, select your application, and click Attach
  • Once the attach dialog closes, pause the application by clicking the pause button on the toolbar, or selecting Break All from the Debug menu. This will stop the program looping and let us look at what it's up to
  • Bring up the threads window, ( Debug / Windows / Threads ) and double-click on the worker thread

We should now see the current line of code that is being executed by this thread, which is

while (stopFlag == 0)

So, we're still in the loop waiting for the stopFlag to be set - yet, if you mouse over the stopFlag variable, or Add it to the Watch window, we'll see that it has a value of 1

The flag has been set, but our loop is still going. What gives?
To answer this, we need to bring up the Disassembly window. ( Debug / Windows / Disassembly ). This window shows us the raw x86 machine code that has been generated by the JIT compiler, and is the actual code that is being executed on the CPU.

The VS disassembly window will print each line of C# source code (if it has it), followed by the x86 assembly. For the line we're stuck on, we get this:

while (stopFlag == 0)
0000004a  mov         eax,dword ptr ds:[007A328Ch] 
0000004f  test        eax,eax 
00000051  jne         00000057 
00000053  test        eax,eax 
00000055  je          00000053 

If you know how to read x86 assembly, I congratulate you. I only know the basics, but enough that I can walk you through it and show what's going on.

So first, this line:

0000004a mov eax,dword ptr ds:[007A328Ch]

  • The instruction is mov which is x86 for "set something". The first parameter is the target, and the second is the source.
  • The First parameter is eax which refers to the EAX register inside the CPU. Intel x86 processors have 4 general purpose registers - EAX through EDX - that are used to perform the vast majority if work that happens in an application.
  • The Second parameter is dword ptr ds:[007A328Ch]. The dword ptr bit is a Size directive ( see this page and scroll down to "Size Directives" ) and it's needed to tell the CPU how many bytes we should be moving. dword ptr simply means 32 bits. I'm not entirely sure what the ds: bit means, but the most important part is this: [007A328Ch]. The square brackets mean "the value in memory at the given location". Our stop flag is static, so it's stored in a fixed memory location, in my case 007A328C

Translated: "Fetch 32 bits from memory at location 007A328C, and put them in the EAS register." In other words - load our stopflag into the CPU

The next part is this:
0000004f test eax,eax
00000051 jne 00000057

  • The test eax,eax is basically asking the question "Does EAX contain zero?"
  • The jne 00000057 is the Jump-if-not-equal instruction. It translates to "If it was NOT zero, jump to line 57

We know however, that when the loop starts, the flag was in fact zero, so we'll get past this block and onto the next one:

00000053 test eax,eax
00000055 je 00000053

  • Again, "Does EAX contain zero?"
  • The je 00000053 is the Jump-if-equal instruction. It translates to "If it WAS zero, jump to line 53, which is the start of our loop

Putting this all together, we end up with this sequence of events:

  • Load the stopflag from memory into EAX
  • Loop waiting for EAX to become non-zero

The problem is, at no point does any code ever set eax once we enter the loop.

What's going on here? The answer is that the C# and JIT compiler only consider the single-threaded case when doing optimizations. From the point of view of the WorkerThread, we never set the stopFlag, and we never call any other functions, so there is no possibility that the stopFlag can change. The JIT compiler can apply this, and decide that it only needs to read the stopFlag a single time, leading to this infinite loop!

To fix it, we simply mark the stopFlag variable as volatile.

This causes the assembly to change to this:

0000004a cmp dword ptr ds:[0088328Ch],0
00000051 jne 0000005C
00000053 cmp dword ptr ds:[0088328Ch],0
0000005a je 00000053

The key difference is that we see [0088328Ch] in the repetitive part of the loop. This means that at each iteration through the loop we go to memory and read the value. This means that when the stopFlag gets set after a second, we'll re-read it and the program behaves correctly. Great!


There are a couple of things I'd like to go into more detail about:

Firstly, the weird double-check that the compiler generates for loops.
When the JIT compiler generates a loop it always seems to use this pattern:

  • First, check the opposite of the loop condition. If this is true, jump over the entire loop code
  • Now we enter the loop where we do the repeated action, and check the actual loop condition

Technically, there's no reason why we need to check it twice, we could get by with just the second part where we check the positive loop condition. I've no expert, but from what I can guess this is probably to help with branch prediction.
My personal theory is that by splitting the loop into 2 conditions, the branch predictor can track them both seperately.
If we bail out of the loop before it even gets started, the branch predictor won't need to worry about the subsequent code, but if we enter the second phase of the loop, the odds are good that we'll keep looping, so the branch predictor can use this information.

Secondly, the use of the cmp instruction rather than test in the loops
This stackoverflow question/answer sheds some light on the topic

The test instruction is smaller (different x86 instructions take different numbers of bytes to encode) than cmp, so the compiler will prefer it over cmp where both would achieve the same result. test works by doing a bitwise AND on the two values, whereas cmp does subtraction.

In the first case, where the value has already been loaded into a register, ANDing it with itself is a nice fast way to check for zero. In the second case however, where the value is being read from memory every time, in order to use test, we'd have to either use an extra register to store it in, or read the memory twice for both sides of the test operation. Both of which would be slower than against the constant 0.

Thirdly, what happened to the toggling inside the loop?
You may have noticed that the disassembly at no point shows anything to do with the loop body of toggle = !toggle;.

This is just the compiler optimizer doing it's job.

toggle is a local variable, it's not returned, and it's not passed to any functions - therefore, whatever we do to it can have no possible effect on anything. The compiler detects this, and simply removes it entirely.

So, what does this mean about volatile?
There seems to be a lot of confusion around what the volatile keyword is supposed to do. There are many questions/answers and blog posts which attempt to explain it. Some say say that it will insert memory barriers either before or after writes to the variable, some say that it causes reads and writes to be translated to calls to the Thread.VolatileRead or VolatileWrite method (both of which insert explicit barriers - but as we can see from our disassembly, it isn't inserting any such barriers, it's just doing normal ordinary reads and writes.

I think a lot of this confusion comes from the MSDN volatile documentation. The first hit in google for "C# Volatile Keyword" is the MSDN documentation, which states The system always reads the current value of a volatile object at the point it is requested, even if the previous instruction asked for a value from the same object. Also, the value of the object is written immediately on assignment.

In order for this to be true (particularly the "value is written immediately" part) one might infer that the compiler would insert memory barriers. Unfortunately, this page is the documentation for Visual Studio 2003. Subsequent revisions (2005, 2010, etc) re-word the documentation to be less misleading, but unfortunately google is still linking everyone to the old out of date page! For more, see

Additionally, the ECMA CLI spec states that reads to volatile fields should have acquire semantics, and writes to volatile fields should have release semantics. This could also cause people to think that memory barriers might be neccessary, but reads and writes on x86 already have these semantics anyway, so the net effect of volatile (other than compiler optimizations) is... Nothing!. On ARM or Itanium however, the JIT will insert memory barriers around reads and writes to volatile fields to give you the acquire/release semantics, as those CPU's don't guarantee it natively.

No comments: