Tuesday, September 04, 2012

TechEd 2012 background 1 - Race conditions and Interlocked operations

Note: This goal of this post is to provide more in-depth information and reference for attendees about a presentation I'm giving at Tech Ed New Zealand 2012.
The session is DEV402 - Multithreaded programming in .NET from the ground up.

This first post goes over what is happening to cause a basic race condition, caused by the interruption and interleaving of 2 threads accessing a shared variable.

Here is a buggy C# program. Feel free to copy/paste it into visual studio and run it yourself.
I'm guessing from what I've seen that probably 98% of the .NET code people are writing is running in 32-bit mode, so that's what I'm focusing on.

public static class BasicProgram
{
    public static int value = 0;

    const uint Iterations = 100000000;

    static System.Diagnostics.Stopwatch watch = new System.Diagnostics.Stopwatch();

    static Object x = new object();

    static void Main(string[] args)
    {
        watch.Start();
        new Thread(Worker1) { IsBackground = true }.Start();
        new Thread(Worker2) { IsBackground = true }.Start();

        Console.WriteLine("Running... Press enter to quit");
        Console.ReadLine();
    }

    static void Worker2(object _)
    {
        for (uint i = 0; i < Iterations; i++)
        {
            value++;
        }

        Console.WriteLine("Worker2 done at {0}ms", watch.ElapsedMilliseconds);
    }

    static void Worker1(object _)
    {
        var oldValue = value; var newValue = value;

        for (uint i = 0; i < Iterations; i++)
        {
            oldValue = value;
            value++;
            newValue = value;

            if (newValue < oldValue)
                Console.WriteLine("i++ went backwards! oldValue={0}, newValue={1}!", oldValue, newValue);
        }

        Console.WriteLine("Worker1 done at {0}ms", watch.ElapsedMilliseconds);
    }
}


The bug manifests itself by hitting the sanity-check in Worker1. Usually for me this happens anywhere between 1 and 10 times, but sometimes it doesn't happen at all. The loop runs 100 million times, yet the bug only happens rarely - this is a good example of just how random and hard to reproduce these kinds of threading issues can be.

Here's the output for a random run on my laptop (dual-core Intel i5 with hyperthreading):

Running... Press enter to quit
i++ went backwards! oldValue=35777960, newValue=35777925!
Worker2 done at 266ms
Worker1 done at 402ms

This is pretty much the standard textbook race condition.

I'd imagine most developers with a bit of experience will have seen this or something similar before, and so I don't think the purely educational value for this is that high, but my main goal is to use it this as a starting point.
It teaches (or reinforces) several fundamental key points that I want people to have in their minds, so I can build upon these points to explain more complex threading bugs.

These fundamental lessons are:

  • In order to do any work, the computer has to Read from memory, Modify the value inside the CPU, then Write back to memory. These are 3 (or more) discrete steps.
  • Because of this, simple-looking things such as value++ actually turn out to be multi-step operations
  • The OS will interrupt the operation of your threads at arbitrary points... If a thread is in the middle of a multi-step operation, this may mean another thread can come along while it's suspended, and modify data out from underneath it. This is the root cause of the bug.

You can fix this by using a lock - The idea of course is that if you acquire a lock before doing the read-modify-write, then other threads must wait for you to release the lock before they can touch the shared value. If a thread gets suspended in the middle of this read-modify-write operation, well, other threads must wait for it to be resumed so it can release the lock, and the whole "modifying data out from underneath it" simply can't happen.

This is great, and I want to make the point that normally this is exactly the right thing to do, but there is also another way to solve this problem - by using Atomic Operations.

What are atomic operations? All CPU's read and write to memory in certain chunks of bytes

  • A 32-bit processor (or a 64-bit processor running in 32-bit mode) will read or write to memory in blocks of 32-bits
  • A 64-bit processor running in 64-bit mode will read or write to memory in blocks of 64-bits

Handily, this is the same size as a pointer in native C or C++ code for those platforms, and .NET Object references are just pointers. (You don't get to see the pointers directly, they are managed for you, hence why .NET is a "managed" language - but they're still there in exactly the same way they would be in a C++ program)

Additionally, there's the functions in System.Threading.Interlocked

What I don't mention in the talk is memory alignment. Memory alignment affects atomicity, as if your 32-bit int is split over 2 blocks due to alignment issues, you will have to read/write both those blocks, and hence it won't be atomic.

I chose not to mention this because pretty much the only way to get unaligned memory access in .NET is when fiddling with a struct that uses the [StructLayout] attribute , which only really happens for P/Invoke calls.
The remaining 99.9% of .NET code is likely to have properly aligned memory access patterns, so I didn't think it worth spending time on during the talk.

Before applying the interlocked fix, first we can see what happens by simply applying a lock.

value++; becomes lock(x){ value++; }

and we get this output:


Running... Press enter to quit
Worker2 done at 3401ms
Worker1 done at 5539ms

Let's compare that with Interlocked Increment:

value++; becomes Interlocked.Increment(ref value);

Here's the output:

Running... Press enter to quit
Worker2 done at 2269ms
Worker1 done at 2276ms

Hooray! No more bugs. You'll notice however, that it's around twice as fast as using a lock, but still much slower than no synchronization at all - Why does it behave like this? Let's dig in and find out.

If we view the disassembly produced by a call to Interlocked.Increment, we see this:

lock inc dword ptr ds:[00B2328Ch]

  • Our value variable is static, so it will be stored at a fixed location in memory. This location happens to be 00B2328C for me.
  • The dword ptr bit before it is a Size Directive ( see this page and scroll down to "Size Directives" ) - we need to tell the inc instruction how big the thing we're incrementing is - dword ptr basically means 32 bits.
  • The inc instruction does what it says it does... increments a value
The magic comes from the lock prefix. Here's a stackoverflow question with a good answer explaining it. Basically what this does is tell the CPU to lock the memory bus for the duration of the operation. As such the read-increment-write operation can't be interrupted.

So essentially, Interlocked.XYZ operations actually are just using locks, but the difference is that they are CPU-level hardware locks, rather than the normal software locks we're used to. Pretty cool huh. It also explains why the Interlocked version takes longer - all that locking and unlocking of the memory bus takes time, and additionally if one CPU locks it, the other one has to wait, just like a normal lock.

So, what about the lock version? What makes it twice as slow as Interlocked?
Microsoft makes available the Shared Source CLI 2.0. This is a "reference implementation" of .NET version 2.0, and it's sufficiently similar to .NET 4.0 and 4.5 in most areas that we can consider it to be pretty much correct. Think of it as the source code for the core native C++ parts of .NET. The garbage collector, JIT compiler, etc.

If you download this and dig through it, you'll find the code that implements the native CLR lock in \clr\src\vm\syncblk.cpp. I'll spare you the effort of having to parse it all - but what it comes down to is that in the best-case scenario (no lock contention, etc), a .NET lock must do at least one InterlockedCompareExchange to acquire the lock, and another InterlockedCompareExchange to release it.

There's obviously a lot more going on, but it's pretty easy to infer from that why the performance is different. A lock is doing two interlocked operations, and Interlocked.Increment is only doing one. This lines up quite nicely with our observed results.

I hope you enjoyed or learned something, I know I certainly had fun figuring all this stuff out. I will follow this post up with more in-depth details of two other causes of threading bugs - compiler optimizations, and memory re-ordering

5 comments:

Anonymous said...

I think you have the right idea, but I think you may be missing some details. The error that your code is running into is driven by the fact that you are reading and storing a value, then incrementing the original, and then reading and storing the new value. These three operations are the ones you need to be concerned about, not the actual reading and writing of the integer values in single statements.

For instance, value++, is already an atomic operation in c#. You do not need to worry about Interlocking to increment. As per the MSDN documentation, reads and writes to standard integers are by default atomic.

What you are experiencing is cache decoherence. The Interlock forces a read of the value from memory and write to memory rather than the cache, thus making sure that you get the most updated copy all the time instead of just some of the time. The same can be achieved with marking your value variable volatile.

All in all, by introducing a mutex on one line instead of the entire operation, you've simply made a race condition less likely, not impossible. To guarantee no race condition, you would have to put a lock around the entirety of the following:

oldValue = value;
value++;
newValue = value;

Still, good post! My comments are c# and .net specific but from a generic standpoint, everything you say is true. There is some good coverage of the fundamentals here and a good lesson for developers of any experience level.

Anonny Mouse said...

Volatile will not fix this.

Use LINQPad to try it, it slows things down enough that I can get the error every time, add volatile to the
"value" and it will still happen

Anonymous said...

No, volatile shouldn't, and wouldn't fix it, that is just my point. The entire block needs a lock, not just one line.

insanity said...

Hi Darrell...

you're actually flat-wrong regarding the ++ operation being thread safe. See the stackoverflow discussion here: http://stackoverflow.com/questions/4628243/is-the-operator-thread-safe

Interlocked.Increment is the way to go...

Orion Edwards said...

Thanks for the feedback Darrell and others

Regarding the atomicity of value++, without any Interlocking or locks, the JIT compiler simply emits inc dword ptr ds:[00B2328Ch] without the lock prefix.

This is a single CPU instruction, so at first glance it may appear "atomic", but in actual fact the 3-step load,inc,store is still happening.

Regarding cache decoherence - in theory this could happen, except that I'm running on an intel core i5 which (along with most other ordinary CPU's) implement hardware cache coherency, which would rule it out.