Imperfect C++ Practical Solutions for Real-Life Programming By Matthew Wilson
	Table of Contents

	Chapter 10. Threading

10.1. Synchronizing Integer Access

Since the contents of the processor registers are stored in the thread context each time a process experiences a thread switch, the most basic form of synchronization is that which ensures that access to a single memory location is serialized. If the size of the memory to read at that location is a single byte, then the processor will access it atomically. Indeed, the same would apply to the reading of larger values according to the rules of the given architecture. For example, a 32-bit processor would ensure that accessing a 32-bit value was serialized, so long as the value was aligned on a 32-bit boundary. Serialization of access to nonaligned data may or may not be provided by any given processor. It's hard to imagine a workable architecture where such atomic operations were not provided.

The processor's guarantees are fine if you want to read or write platform-sized integers atomically, but there are many operations that you would wish to have atomic that are not so because they are actually several operations in one. The classic one of these is incrementing or decrementing variables. The statement



++i;

is really only a shorthand for



i = i + 1;

The increment of i involves a fetch of its value from memory, adding 1 to that value, and then storing the new value back in i's memory location, known as Read-Modify-Write (RMW) [Gerb2002]. Since this is a three-step process, any other thread that is concurrently attempting to manipulate the value of i can cause the result in either or both thread(s) to be invalidated. If both threads are attempting to increment i, it is possible for the threads to be out of step, as in:



          Thread 1                 Thread 2


          load i from memory


                                   load i from memory


                                   increment value


          increment value


                                   store new value to i


          store new value to i

Both threads loaded the same value from memory, and when Thread 1 stores the incremented value, it writes over the result of Thread 2. The effect is that one of the increments is lost.

In practice, different processors will have different operations for conducting these steps. For example, on the Intel processor it could be implemented as follows:



mov eax,dword ptr [i]


inc eax


mov dword ptr [i], eax

or, more succinctly, as



add dword ptr [i],1

Even in the second form, where there is a single instruction, the operation is not guaranteed atomic on multiprocessor systems, because it is logically equivalent to the first, and a thread on another processor may interleave in the manner described above.

10.1.1 Operating System Functions

Since atomic increments and decrements are the cornerstone of many important mechanisms, including reference counting, it is very important that there are facilities for conducting these operations in a thread-safe manner. As we will see later, atomic integer operations can often be all that is required to make nontrivial components thread safe,^[2] which can afford considerable performance savings.

^[2] The Synesis BufferStore component that I mentioned in Chapter 7 is one example, deriving its high speed from the avoidance of any kernel object synchronization.

Win32 provides the InterlockedIncrement() and InterlockedDecrement() system functions, which look like the following



LONG InterlockedIncrement(LONG *p);


LONG InterlockedDecrement(LONG *p);

These implement preincrement and predecrement semantics. In other words, the return value reflects the new value, rather than the old. Linux provides analogous functions [Rubi2001]: atomic_inc_and_test() and atomic_dec_and_test(). Similar functions may be available on a platform-specific basis.

Using such functions we can now rewrite our initial increment statement in a thoroughly thread-safe manner



atomic_inc_and_test(&i); // ++i

The implementation of such a function for the Intel processor would simply incorporate the LOCK instruction prefix, as in:^[3]

^[3] This is not the actual instruction(s), but I'm trying to be brief!



lock add dword ptr [i], 1

The LOCK instruction prefix causes a LOCK# signal to be expressed on the bus, and prevents any other threads from affecting that memory location for the duration of the ADD instruction. (Naturally, it's a lot more complex than this, involving cache lines and all kinds of magic jiggery-pokery, but logically it makes the instruction atomic with respect to any other threads/processors.^[4])

^[4] On multi-CPU machines, or machines with hyperthreading/multiple-core CPUs, threads may really be running in parallel, whereas on single CPUs threads only appear to execute simultaneously. In single-CPU machines, interrupts may interrupt instructions (except for atomic ones).

The downside to applying locking semantics is a cost in speed. The actual costs will differ for different architectures: on Win32 the cost can be roughly 200–500% that of the nonlocked variant (as we'll see later in this chapter). Thus, it is not desirable simply to make every operation thread safe; indeed, the whole point of multithreading is to allow independent processing to take place concurrently.

In practice, therefore, one normally determines whether to use atomic operations dependent on a build setting. In UNIX, one tends to use the _REENTRANT preprocessor symbol definition to indicate to C and C++ code that the link unit is to be built for multithreading. In Win32 it is _MT or __MT__ or similar, depending on the compiler. Naturally, such things are abstracted into a platform/compiler-independent symbol, for example, ACMELIB_MULTI_THREADED, which is then used to select the appropriate operations at compile time.



#ifdef ACMELIB_MULTI_THREADED


  atomic_increment(&i);


#else /* ? ACMELIB_MULTI_THREADED */


  ++i;


#endif /* ACMELIB_MULTI_THREADED */

Since this is ugly stuff to be dotted around, it's also common practice to abstract the operation into a common function within which the preprocessor discrimination can be encapsulated. Plenty of examples of this exist in common libraries, such as Boost and Microsoft's Active Template Library (ATL).

I should point out that not all operating systems provide atomic integer operations, in which case you may need to resort to using an operating system synchronization object, for example, a mutex, to lock the access to the atomic integer API, as we'll look at later in the chapter.

10.1.2 Atomic Types

We've seen how we can simply use – – and ++ on the integer types for single threaded contexts, but for multithreading we need to use operating system/architecture primitives. The downside of this approach is that even when we abstract the differences into a common function, say integer_increment, it relies on all uses of the integer being done atomically. It's not terribly difficult to forget one, in which case you could have a race condition in your application that's very difficult to diagnose.

C++ is a language that aims to provide uniformity of syntax by allowing user-defined types to appear as built-in types. So why not make atomic types that look like built-in types, except that they operate atomically, in all respects? There's no reason why not, and it's actually very simple to do:^[5]

^[5] The volatile qualifiers facilitate the use of volatile as a qualifier on the declaration of any such variables, which is a reasonably common practice in multithreaded code, since it prevents the compiler from manipulating variables in its internal registers, thus failing to synchronize the value with the actual memory location of the variables. We'll look at it in more detail in section 18.5.

Listing 10.1.



class atomic_integer


{


public:


  atomic_integer(int value)


    : m_value(value)


  {}


// Operations


public:


  class_type volatile &operator ++() volatile


  {


    atomic_increment(&m_value);


    return *this;


  }


  const class_type volatile operator ++(int) volatile


  {


    return class_type(atomic_postincrement(&m_value));


  }


  class_type volatile &operator —() volatile;


  const class_type volatile operator —(int) volatile;





  class_type volatile &operator +=(value_type const &value) volatile


  {


    atomic_postadd(&m_value, value);


    return *this;


  }


  class_type volatile &operator -=(value_type const &value) volatile;





private:


  volatile int m_value;


};

This is a clear area in which C++ makes threading easier and simpler. However, there is a question as to how much of the natural semantics of integer types are made available. The code above shows how easy it is, given a library of atomic integer functions, to implement increment/decrement, and addition and subtraction. However, doing multiplication, division, logical operations, shifting, and other operations are considerably more complex, and most atomic integer libraries do not provide such operations. If you want them in your code, that'll have to be the author's get-out-of-jail-free card: an exercise for the reader.

The Boost atomic operations components take just this approach, providing platform-specific versions of a type, atomic_count, which provides only ++ and – operations (atomic_increment/atomic_decrement), along with implicit conversion (atomic_read), so if you choose to opt out of anything much more complicated, you'll be in good company.