Monday, October 08, 2012

TechEd 2012 background 3 - Memory Barriers

Note: This goal of this post is to provide more in-depth information and reference for attendees about a presentation I'm giving at Tech Ed New Zealand 2012.
The session is DEV402 - Multithreaded programming in .NET from the ground up.

This third post shows one the effect of the CPU re-ordering reads and writes to memory in a C# program, and explains how to fix it using Thread.MemoryBarrier

This sample program basically fires 2 functions at the threadpool over and over. In theory we can't tell which will run first, it's effectively random - This program just tries to provide an empirical measurement.

class Program
{
    static volatile int a = 0;
    static volatile int b = 0;
    static volatile bool a_ranFirst, b_ranFirst;

    static object gate = new object();

    static void ThreadA()
    {
        a = 1;
        //Thread.MemoryBarrier();
        if (b == 1)
            b_ranFirst = true;
    }

    static void ThreadB()
    {
        b = 1;
        //Thread.MemoryBarrier();
        if (a == 1)
            a_ranFirst = true;
    }

    static void Main(string[] args)
    {
        int aFirst = 0, bFirst = 0;
        int together = 0;
        int WTF = 0;

        var sw = new Stopwatch(); sw.Start();

        const int Iterations = 500000;

        for (int i = 0; i < Iterations; i++)
        {
            a = b = 0;
            a_ranFirst = b_ranFirst = false;
            Parallel.Invoke(ThreadA, ThreadB);

            if (a_ranFirst && !b_ranFirst)
                aFirst++;
            else if (!a_ranFirst && b_ranFirst)
                bFirst++;
            else if (a_ranFirst && b_ranFirst)
                together++;
            else
                WTF++; //
        }
        Console.WriteLine("Done in {0}ms", sw.ElapsedMilliseconds);

        Console.WriteLine("Thread A ran first: {0} times ({1} percent)", aFirst, ((double)aFirst / (double)Iterations) * 100);
        Console.WriteLine("Thread B ran first: {0} times ({1} percent)", bFirst, ((double)bFirst / (double)Iterations) * 100);
        Console.WriteLine("Ran together: {0} times ({1} percent)", together, ((double)together / (double)Iterations) * 100);
        Console.WriteLine("WTF: {0} times ({1} percent)", WTF, ((double)WTF / (double)Iterations) * 100);
    }
}

If I run this as-is, I get something like this:
Done in 1571ms
Thread A ran first: 379388 times (75.8776 percent)
Thread B ran first: 31140 times (6.228 percent)
Ran together: 59 times (0.0118 percent)
WTF: 89413 times (17.8826 percent)
76% of the time, A runs first (this makes sense given that it's the first argument to Parallel.Invoke) - 6% of the time B runs first, and a very small percentage of the time they run together in lockstep. However, 18% of the time, something else is happening. What's going on? To break this down, the important operations that happen on each thread are:
  • Store the value 1 to the memory location A (or B for thread B)
  • Load the value from the other memory Location
  • Figure out whether this thread was first on second based on what we read
  • Set a flag to indicate whether this thread was first or second
The first 2 steps are the key to explaining the bug. Here's what we might expect to be happening
  • When "A runs first", thread A performs both the store to A and load from B before thread B gets started.
  • When "B runs first", thread B performs both the store to B and load from A before thread A gets started.
  • When they run together, thread A stores, thread B stores, then they both load

So, what about the Fourth, "WTF" scenario? We've already marked the a and b variables as volatile, so this should tell the compiler to leave them alone. Something else is happening, but what?

Let's have a look at the assembly that's getting generated:
a = 1;
00000000  mov         dword ptr ds:[0111328Ch],1 
if (b == 1)
0000000a  cmp         dword ptr ds:[01113290h],1 
00000011  jne         0000001A 
b_ranFirst = true;
00000013  mov         byte ptr ds:[01113295h],1 
0000001a  ret 

A quick run through:
  • The first mov operation is writing 1 to memory location 0111328C (where the compiler has stored a)
  • The next cmp/jne instructions read 01113290 (where the compiler has stored b), and jump to the end of the function if it's not set to 1.
  • If it doesn't get jumped over, the final mov operation is writing 1 to location 01113295 (where b_ranFirst lives)

This all adds up - the instructions we'd logically expect to happen are indeed happening, and it's all in the order we'd expect. Clearly neither the C# or JIT compiler is at fault here... so what is?

Well, as it turns out there are actually 3 more scenarios that are occurring.

  • Thread A performs load from B; thread B then does it's normal store/load; finally, thread A performs store to A
  • Thread B performs load from A; thread A then does it's normal store/load; finally thread B performs store to B
  • thread A loads, thread B loads, then they both store

In spite of the fact that our code very explicitly always stores before it loads, the operations are being performed in the wrong order.

This is Out-of-Order Execution in action. All Intel CPU's (except the Atom) since 1995 do this, as do pretty much all AMD processors. I've got a Core i5 here, which certainly does this reordering.

Why reordering?

It takes many hundreds of CPU cycles to do a read or write to memory. The number varies quite a lot depending on what CPU you have, but it's always a significant amount. To try compensate for this, out-of-order CPU's will try and rearrange reads and writes such that they can carry on doing work while waiting for the data to arrive or the write to complete. Writes can also be batched in some situations, so the CPU may hold off doing a write until a second one arrives to issue a single batch.

When is reordering a factor?

For the majority of code, reordering isn't an issue.
While it is executing code, a CPU keeps track of whatever re-ordering it does, and makes sure that all operations appear to execute as you'd expect. If it moves something before or after another operation, then it will make sure to fix up the result before the next thing needs to access the result.

In computers with only a single CPU core, all code is effectively single threaded like this - While the operating system can and does interrupt and switch the running thread randomly and unpredictably, when it does this it inserts appropriate instructions (memory barriers) which cause any re-ordered operations to be "flushed" before the thread is context switched out.

The problem arises in multi-CPU systems.
While each CPU is keeping track of and fixing up all it's shenanigans, it's not telling the other CPU's about what it's doing. When other CPU's read or write, they don't get this "fixing up" logic applied, and they see reads and writes in the actual (rearranged) order they're happening.

OK, so how do we fix it?

The first thing we might like to do is somehow tell the CPU to turn this reordering off. Unfortunately (or fortunately, depending on your point of view) - you can't. Furthermore, there's no tracing or other kinds of diagnostics to tell you when you're being affected by this. Unfortunately, you have to resort to good old logic and knowledge. You have to know what the rules for reordering are for your CPU, analyze the code, and work out if reordering could possibly cause your code to fail. If it could - you have to defend against it.

And how do we defend against it? This is where the Memory Barrier comes in. A memory barrier acts as a "fence" or "line in the sand", or other such metaphor. Basically, it's a wall that the CPU can't move reads or writes through.

There are 3 kinds of memory barriers. Most of the information that I've been able to find is quite confusingly worded, so I'll do my best to be clear:

  • Read barriers (also known as "Load" or "Acquire" barriers/fences). These prevent loads from memory from moving upwards in the instruction sequence.
  • Write barriers (also known as "Store" or "Release" barriers/fences). These prevent writes to memory from moving downwards in the instruction sequence.
  • Full barriers. These prevent all re-oredering