Imperfect C++ Practical Solutions for Real-Life Programming By Matthew Wilson
	Table of Contents

	Chapter 10. Threading

10.3. Atomic Integer Performance

Before we and move on to multithreading extensions (section 10.4) and Thread Specific Storage (section 10.5), I want to take a look at the performance aspects of various atomic integer operation strategies.

10.3.1 Atomic Integer by Mutex

Where your atomic integer operations are not provided by your operating system, you may need to resort to using a mutex to lock the access to the atomic integer API, as shown in Listing 10.3.

Listing 10.3.



namespace


{


  Mutex s_mx;


}


int atomic_postincrement(int volatile *p)


{


  lock_scope<Mutex>   lock(s_mx);


  return *p++;


}


int atomic_predecrement(int volatile *p)


{


  lock_scope<Mutex>   lock(s_mx);


  return --*p;


}

The problem here is performance. Not only do you pay the sometimes considerable cost of going to the kernel in the acquiring and release of the mutex object, but you also have contention from several threads wanting to perform their atomic operations simultaneously. Every single atomic operation within your process involves the single mutex object, which naturally leads to a bottleneck.

I once witnessed a tragic attempt to ameliorate this cost by having a separate mutex for each atomic function. Unfortunately, this proved very successful in reducing waiting times. As I'm sure you've guessed, this was thoroughly tested on a single-processor Intel machine. As soon as the application was run on a multiprocessor machine, it fell in a heap.^[9] Since each mutex protected the function, rather than the data, it was possible to have some threads incrementing a variable while another was decrementing it. All that was achieved was to prevent two threads doing the same thing to the same integer at the same time. As with so many things in multithreading, you cannot have confidence in your code until you've tested it on a multiprocessor machine.

^[9] It would suffer the same fate on modern hyperthreading single processor machines.

Despite that abject failure, there is a way to share the contention between more than one mutex in order to reduce the bottleneck. What is required is to base the mutex selection on a property of the variable being manipulated. Well, there's only one attribute we know about it: its address. (We can't very well know its value, since that is going to be changing.)

This looks something like the following:



namespace


{


  Mutex       s_mxs[NUM_MUTEXES];


};





int __stdcall Atomic_PreIncrement_By(int volatile *v)


{


  size_t            index = index_from_ptr(v, NUM_MUTEXES);


  lock_scope<Mutex> lock(s_mxs[index]);


  return ++*(v);


}

The function index_from_ptr() provides a deterministic mapping from an address to an integer in the range [0, NUM_MUTEXES–1). It would not be suitable simply to perform modulo division of the address, since most systems align data on boundaries of 4, 8, 16, or 32 bytes. Something like the following might be suitable:



inline size_t index_from_ptr(void volatile *v, size_t range)


{


  return (((unsigned)v) >> 7) % range;


}

In testing on my own Win32 machine, I found 7 to give better performance than other values, but that's unlikely to translate to other platforms, so you'd have to optimize this for your platform.

10.3.2 Run Time Architecture Dispatching

I'd like to show you a little trick for enhancing performance of atomic integer operations for the Intel platform. As we've learned, the Intel processor will conduct a single-instruction RMW operation (such as ADD, XADD) uninterrupted, so on single processor machines a bus lock is not required. Conversely, the bus must be locked for these operations on multiprocessor machines. Since it's much easier to build and ship a single version of code, it would be nice for our code to only pay the performance cost of bus locking when necessary. Since the number of instructions in either case is very small, we'd need a very efficient way of doing this, otherwise the test would cause more latency than the savings gained. The solution looks like the simplified form^[10] shown in Listing 10.4, which is compatible with most modern Win32 compilers.

^[10] The full implementations for these functions are included on the CD.

Listing 10.4.



namespace


{


  static bool s_uniprocessor = is_host_up();


}





inline __declspec(naked) void __stdcall


  atomic_increment(sint32_t volatile * /* pl */)


{


  if(s_uniprocessor)


  {


    _asm


    {


      mov ecx, dword ptr [esp + 4]


      add dword ptr [ecx], 1


      ret 4


    }


  }


  else


  {


    _asm


    {


      mov ecx, dword ptr [esp + 4]


      lock add dword ptr [ecx], 1


      ret 4


    }


  }


}

Even if you're not familiar with the Intel assembler, you should be able to understand how simple the mechanism is. The s_uniprocessor is true for uniprocessors, and false for multiprocessor machines. If it's true, the increment is effected without a lock. If it's false, the lock is used. Any possible race conditions in the instantiation are unimportant, since the default case is to apply the lock, which is benign.

In the performance tests below, this is the mechanism used by the Synesis Atomic_ API and the WinSTL atomic functions.

10.3.3 Performance Comparisons

I've done a lot of talking about the various mechanisms available for atomic integer operations, so let's look at some facts. Before we do so, I'd like to stress that the figures in this section reflect the performance of Win32 operating systems on (single- and multi-) processor Intel architecture only. Other architectures and/or operating systems may have different characteristics.

I've examined seven strategies. For each there is a common global variable which is either incremented or decremented by the thread. The first strategy—Unguarded—does no locking, and simply increments or decrements the variable via the ++ or – operators. The next two use the architecture-dispatching techniques: the Synesis Atomic_* library functions and WinSTL's inline functions. The fourth calls the Win32 Interlocked_* system functions. The final three use a synchronization object—the Win32 CRITICAL_SECTION, the WinSTL spin_mutex and the Win32 mutex kernel object—to guard access to the critical region, within which the ++ or – operator is used to modify the variable. The results are shown in Table 10.2, which includes the total times for 10 million operations for 31 contending threads for each strategy. Since these were measured on different machines, the relative performance figures are also obtained.

Table 10.2.
Synchronization Scheme
Uniprocessor Machine
SMP machine
Absolute (ms)
% of unguarded
Absolute (ms)
% of unguarded
Unguarded ++ / ––
362
100%
525 (incorrect)
100%
Synesis Atomic_* API
509
141%
2464
469%
WinSTL atomic_* inline functions
510
141%
2464
469%
Win32 Interlocked* API
2324
642%
2491
474%
Win32 CRITICAL_SECTION
5568
1538%
188235
35854%
WinSTL spin_mutex
5837
1612%
3871
737%
Win32 MUTEX
57977
16016%
192870
36737%

Deliberately, the test program that exercised the various locking mechanisms spawned an odd number of threads, such that when all threads were complete the manipulated variables should have a large non-zero value equal to the number of iterations. This was a quick validation that the operations were indeed atomic. All of the cases shown, including the Unguarded manipulation on the uniprocessor machine behaved correctly. Verifying what we know about the nonatomic nature of some single instructions on multiprocessors, the Unguarded/SMP case produced wildly different values on each run, indicating that threads were interrupting each other's RMW cycles.

In terms of the performance, several things stand out. First, the relative cost of all mechanisms is higher on the multiprocessor machine, which indicates that multiprocessor caches are things that don't like being disturbed.

Second, as far as the architecture dispatching mechanism—the Synesis Atomic_ API and the WinSTL atomic_* inline functions—is concerned, it's fair to say that it does a very good job on the uni-processor system, being only 22% the cost of the Win32 Interlocked_* system library functions, and only being 141% the cost of the unguarded ++/–– operators. On the multiprocessor machine the additional cost over and above the LOCK for the processor test is very acceptable, being only an additional 1%. I would say that if you're writing applications that need to work on both single and multithreaded architectures, and you want to be able to ship a single version, you're likely to see significant benefits from using this dispatching technique.

The results show the well-known fact that mutexes, as kernel objects, represent a very costly way of implementing atomic operations, and you'd be crazy to use them for that purpose on Win32 systems.

Somewhat different from the mutex is the CRITICAL_SECTION. I don't know about you, but much of the wisdom gleaned as I was learning multithreaded programming on the Win32 platform advocated the use of the CRITICAL_SECTION as a much better alternative to the mutex. Indeed, it appears to be about 10 times as fast on the uniprocessor system. However, it has about the same performance on the multiprocessor machine. Once again, you need to test your code on multiprocessor systems, in this case to verify your assumptions about the efficiency of the mechanisms you are using. I would say that the CRITICAL_SECTION is not a valid mechanism with which to get atomic operations; unlike the mutex I've actually seen a lot of use of it in clients' code bases.

You may wonder why anyone would use a spin mutex to implement atomic operations. Well, the atomic operations provided in Linux's <asm/atomic.h> are only guaranteed to provide a range of 24 bits. Furthermore, on some flavous of Linux, functions with the correct semantics—increment the value and return the previous value—are not available. By using the conceptually simpler write/set functions the full-range atomic operations can still be provided, without incurring too high a performance penalty.

I hope these results give you some pause for thought in your implementations. Remember that this is Win32-specific; other architectures may have significantly different performance behaviors. But the main lesson is to profile, and to question your assumptions.

10.3.4 Atomic Integer Operations Coda

I've focused a great deal on atomic operations in this chapter, and I make no apologies for doing so. There are three reasons for this. First, they are something that I think could be incorporated directly into C++, which probably cannot be said for other multithreading and synchronization constructs, due to too-great differences or lack of broad availability.^[11]

^[11] Having said that, there are some libraries, such as the excellent PThreads-Win32 (see Appendix A), which go some way to unifying the threading experience.

Second, they are an extremely useful construct, whose power is in inverse proportion to their simplicity. Using atomic integer operations only, one can achieve most or all the required thread safety in a significant number of C++ classes, as is shown in later chapters.

Last, the level of discussion of atomic operations in the literature is scant, to say the least. Hopefully by providing some focus here you'll think of them more often.

Atomic operations are available on many platforms, either as system library functions, or as nonstandard libraries, or by writing your own assembler versions. (I'm aware of the irony in promoting the use of assembler in a book largely dedicated to showing advanced C++ techniques; we live in a strange world.)

Even when we elect to use a (usually PTHREADS) mutex to implement our atomic operation, there are measures available to increase efficiency. The one thing to watch out for is not to use spin-mutexes in such cases. You'll be using several instances of a mutex class that are implemented in terms of an atomic integer API that is implemented on one, or a few, global mutexes. In such a case, you should use some preprocessor discrimination to make your chosen mutex a plain (PTHREADS-based) mutex, otherwise you'll be impacting negatively on performance, which won't be what you'll want or expect.

This is actually a good example of a broader truth to be found in multithreaded development. In practice we really need to consider the details of our synchronization needs and the facilities of the host system(s) on which our applications will run. It would be great if C++ did indeed stipulate atomic operations, but it's my personal opinion that providing standard higher-level synchronization primitives^[12] and maintaining maximum efficiency over different architectures is a lot harder to do. As often the case, you are best served by being aware of all the multithreading tools at your disposal ([Bute1997, Rich1997]).

^[12] There are moves afoot to make this happen in a future version of the standard, at which point I hope all the foregoing will be irrelevant. I doubt it, though.