Imperfect C++ Practical Solutions for Real-Life Programming By Matthew Wilson
	Table of Contents

	Chapter 10. Threading

10.5. Thread Specific Storage

All the discussion in the chapter has so far focused on the issues of synchronizing access to common resources from multiple threads. There is another side to threading, which is the provision of thread-specific resources or as it is more commonly known, Thread-Specific Storage (TSS) [Schm1997].

10.5.1 Re-entrancy

In single-threaded programs, the use of a local static object within a function is a reasonable way to make the function easier to use. The C standard library makes use of this technique in several of its functions, including strtok(), which tokenizes a string based one of a set of character delimiters:



char *strtok(char *str, const char *delimiterSet);

The function maintains internal static variables that maintain the current tokenization point, so that subsequent calls (passing NULL for str) return successive tokens from the string.

Unfortunately, when used in multithreaded processes, such functions represent a classic race-condition. One thread may initiate a new tokenization while another is midway through the process.

Unlike other race-conditions, the answer in this case is not to serialize access with a synchronization object. That would only stop one thread from modifying the internal tokenization structures while another was using them. The interruption of one thread's tokenization by another's would still occur.

What is required is not to serialize access to thread-global variables, but rather to provide thread-local variables.^[18] This is the purpose of TSS.

^[18] And modern C and C++ run time libraries implement strtok() and similar functions using TSS.

10.5.2 Thread-Specific Data / Thread-Local Storage

Both the PTHREADS and Win32 threading infrastructures provide some degree of TSS. Just for consistency, the PTHREADS [Bute1997] version is called Thread-Specific Data (TSD), and the Win32 [Rich1997] version is called Thread-Local Storage (TLS), but it all amounts to the same thing.

They each provide the means to create a variable that will be able to contain different values in each thread in the process. In PTHREADS, the variable is known as a key; in Win32 an index. Win32 refers to the location for a key's value in each thread as a slot. I like to refer to keys, slots, and values.

PTHREADS' TSD works around the following four library functions:



int pthread_key_create(    pthread_key_t  *key


                      ,    void (*destructor)(void *));


int pthread_key_delete(    pthread_key_t  key);


void *pthread_getspecific( pthread_key_t  key);


int pthread_setspecific(   pthread_key_t  key


                       ,   const void     *value);

pthread_key_create() creates a key (of the opaque type) pthread_key_t. The caller can also pass in a cleanup function, which we'll talk about shortly. Values can be set and retrieved, on a thread-specific basis, by calling pthread_setspecific() and pthread_getspecific(). pthread_key_delete() is called to destroy a key when it is no longer needed.

Win32's TLS API has a similar quartet:



DWORD  TlsAlloc(void);


LPVOID TlsGetValue(DWORD dwTlsIndex);


BOOL   TlsSetValue(DWORD dwTlsIndex, LPVOID lpTlsValue);


BOOL   TlsFree(DWORD dwTlsIndex);

The normal way in which these TSS APIs are used is to create a key within the main thread, prior to the activation of any other threads and store the key in a common area (either in a global variable, or returned via a function). All threads then manipulate their own copies of the TSS data by storing to and retrieving from their own slots.

Unfortunately, there are several inadequacies in these models, especially with the Win32 version.

First, the number of keys provided by the APIs is limited. PTHREADS guarantees that there will be at least 128; Win32 64.^[19] In reality, one is very unlikely to need to break this limit, but given the increasingly multicomponent nature of software it is by no means impossible.

^[19] Windows 95 and NT 4 provide 64. Later operating systems provide more (Windows 98/ME: 80, Windows 2000/XP: 1088), but code that must be able to execute on any Win32 system must assume 64.

The second problem is that the Win32 API does not provide any ability to clean up the slot when a thread exits. This means that one has to somehow intercept the thread's exit and clean up the resources associated with the value in that thread's slot. Naturally, for C++ folks, this is a painful loss of the automatic destruction that the language provides for us, and can be next to impossible to work around in some scenarios.

Despite PTHREADS providing a means for cleanup on thread termination, it still presents an incomplete mechanism for easy and correct resource handling. In essence, PTHREADS provides us with immutable RAII (see section 3.5.1). Although this is a great improvement on Win32's absence of any RAII, there are occasions when it would be desirable to be able to change the slot value for a given key. It's possible to manually clean up the previous values, but it'd be a lot better if that is done automatically for us.

The fourth problem is that PTHREADS assumes that the cleanup function is callable at the cleanup epoch. If, at the time that any of the threads exit, an API has been uninitialized, then it may no longer be valid to call a cleanup function that may call that API directly or indirectly. Similarly, and even more likely in practice, if a cleanup function is in a dynamic library, the cleanup function may no longer exist in the process's memory, which means it will crash.

10.5.3 __declspec(thread) and TLS

Before we look at handling those challenges, I'd like to describe one TSS mechanism that is provided by most compilers on the Win32 platform in order to ease the verbosity of using the Win32 TLS functions. The compilers allow you to use the __declspec(thread) qualifier on variable definitions, as in:



__declspec(thread) int  x;

Now x will be thread specific; each thread will get its own copy. The compiler places any such variables in a .tls section, and the linker coalesces all of these into one. When the operating system loads the process, it looks for the .tls section and creates a thread-specific block to hold them. Each time a thread is created a corresponding block is created for the thread.

Unfortunately, despite being extremely efficient [Wils2003d], there's a massive drawback to this that makes it only suitable for use in executables, and not in dynamic libraries. It can be used in dynamic libraries that are implicitly linked, and therefore loaded at process load time, since the operating system can allocate the thread-specific block for all link units loading at application load time. The problem is what happens when a dynamic library containing a .tls section is later explicitly loaded; the operating system is unable go back and increase the blocks for all the existing threads, so your library will fail to load.

I think it's best to avoid __declspec(thread) in any DLLs, even ones that you're sure will always be implicitly linked. In the modern component-based world, it's entirely possible that the DLL may be implicitly linked to a component that is explicitly loaded by an executable produced by another compiler, or in another language, and that does not already have your DLL loaded. Your DLL cannot be loaded, and therefore the component that depends on it cannot be loaded.

10.5.4 The Tss Library

Having been bitten too many times by the four problems associated with the TSS mechanisms of PTHREADS and Win32, I got on the ball and wrote a library that provides the functionality I needed. It consists of eight functions, and two helper classes. The main functions, which are compatible with C and C++, are shown in Listing 10.6:

Listing 10.6.



// MLTssStr.h – functions are declared extern "C"


int     Tss_Init(void);    /* Failed if < 0. */


void    Tss_Uninit(void);


void    Tss_ThreadAttach(void);


void    Tss_ThreadDetach(void);


HTssKey Tss_CreateKey( void (*pfnClose)()


                     , void (*pfnClientConnect)()


                     , void (*pfnClientDisconnect)()


                     , Boolean bCloseOnAssign);


void    Tss_CloseKey(     HTssKey  hEntry);


void    Tss_SetSlotValue( HTssKey  hEntry


                        , void     *value


                        , void     **pPrevValue /* = NULL */);


void    *Tss_GetSlotValue(HTssKey  hEntry);

Like all good APIs it has Init/Uninit^[20] methods, to ensure that the API is ready for any clients that need it. It also has two functions for attaching and detaching threads that I'll talk about in a moment.

^[20] A little tip for all those who use British spelling out there: you can avoid pointless arguments about Initialise vs Initialize with your U.S. friends by using a contraction.

Manipulating keys follows the convention in providing four functions. However, these functions offer more functionality. For providing cleanup at thread termination, the Tss_CreateKey() function provides the optional callback function pfnClose; specify NULL if you don't need it. If you want that cleanup function to be applied to slot values when they are overwritten, you specify TRue for the bCloseOnAssign parameter.

Preventing code from untimely disappearance is handled by the two optional callback function parameters pfnClientConnect and pfnClientDisconnect. These can be implemented to do whatever is appropriate to ensure that the function specified in pfnClose is in memory and callable when it is needed. In my use of the API I have had occasion to specify the Init/Uninit functions for other APIs, or to lock and unlock a dynamic library in memory, or a combination of the two, as necessary.

Tss_CloseKey() and Tss_GetSlotValue() have the expected semantics. Tss_SetSlotValue(), however, has an additional parameter, pPrevValue, over its PTHREADS/Win32 equivalents. If this parameter is NULL, then the previous value is overwritten, and subject to the cleanup as requested in the key creation. However, if this parameter is non-NULL, then any cleanup is skipped, and the previous value is returned to the caller. This allows a finer-grained control over the values, while providing the powerful cleanup semantics by default.

Being a C API, the natural step is to encapsulate it within scoping class(es), and there are two provided. The first is the TssKey class. It's not particularly remarkable—it just simplifies the interface and applies RAII to close the key—so I'll show only the public interface:

Listing 10.7.



template <typename T>


class TssKey


{


public:


  TssKey( void (*pfnClose)(T )


        , void (*pfnClientConnect)()


        , void (*pfnClientDisconnect)()


        , Boolean bCloseOnAssign = true);


  ~TssKey();


public:


  void  SetSlotValue(T value, T *pPrevValue = NULL);


  T     GetSlotValue() const;


private:


  . . . Members; hide copy ctor and assignment operator


};

The implementation contains static assertions (see section 1.4.7) to ensure that sizeof(T) == sizeof(void*), to prevent any mistaken attempt to store large objects by value in the slot. The values are cast to the parameterizing type, to save you the effort in client code.

The next class is a fair bit more interesting. If your use of the slot value were to create a single entity and then to reuse it, you'd normally follow the pattern in Listing 10.8:

Listing 10.8.



Tss key_func(. . .);


. . .


OneThing const &func(Another *another)


{


  OneThing *thing = (OneThing*)key_func.GetSlotValue();


  if(NULL == value)


  {


    thing = new OneThing(another);


    key_func.SetSlotValue(thing);


  }


  else


  {


    thing->Method(another);


  }


  return *thing;


}

However, if the function is more complex—and most are—then there may be several places where the slot value may be changed. Each one of these represents the possibility for a resource leak due to a premature return before the call to SetSlotValue(). For this reason the scoping class TssSlotScope, shown in Listing 10.9, is provided. I confess I have a perverse affection for this class, because it's a kind of inside-out RAII.

Listing 10.9.



template <typename T>


class TssSlotScope


{


public:


  TssSlotScope(HTssKey hKey, T &value)


    : m_hKey(hKey)


    , m_valueRef(value)


    , m_prevValue((value_type)Tss_GetSlotValue(m_hKey))


  {


    m_valueRef = m_prevValue;


  }


  TssSlotScope(TssKey<T> key, T &value);


  ~TssSlotScope()


  {


    if(m_valueRef != m_prevValue)


    {


      Tss_SetSlotValue(m_hKey, m_valueRef, NULL);


    }


  }


private:


  TssKey  m_key;


  T       &m_valueRef;


  T const m_prevValue;


// Not to be implemented


private:


  . . . Hide copy ctor and assignment operator


};

It is constructed from a TSS key (either TssKey<T>, or an HTssKey) and a reference to an external value variable. The constructor(s) then set the external variable to the slot's value, via a call to Tss_GetSlotValue().

In the destructor, the value of the external variable is tested against the original value of the slot, and the slot's value is updated via Tss_SetSlotValue() if it has changed. Now we can write client code much more simply, and rely on RAII to update the thread's slot value if necessary.

Listing 10.10.




OneThing const &func(Another *another)


{


  OneThing                *thing;


  TssSlotScope<OneThing*> scope(key_func, thing);





  if( . . . )


    thing = new OneThing(another);


  else if( . . . )


    thing = . . .;


  else


    . . .





  return *thing;


} // dtor of scope ensure Tls_SetSlotValue() is called

So we've seen how to use the Tss library, but how does it work? Well, I'm going to leave you to figure out the implementation,^[21] but we do need to have a look at how the thread notifications are handled. This involves the two functions I've so far not described: Tss_Thread Attach() and Tss_ThreadDetach(). These two functions should be called when a thread commences and terminates execution, respectively. If possible, you can hook into your operating system or run time library infrastructure to achieve this. If not, then you will need to do it manually.

^[21] Or to take a peek on the CD, since I've included the source for the library. Take care, though, it's old code, and not that pretty! It's probably not that optimal, either, so it should be taken as a guide to the technique, rather than the zenith of TSS library implementations.

On Win32, all DLLs export the entry point DllMain() [Rich1997], which receives notifications when the process loads/unloads, and when threads commence/terminate. In the Synesis Win32 libraries, the base DLL (MMCMNBAS.DLL) calls Tss_ThreadAttach() when its DllMain() receives the DLL_THREAD_ATTACH notification, and calls Tss_Thread Detach() when it receives Tss_ThreadDetach(). Since it is a common DLL, all the other members of any executable can just use the Tss library, without being concerned with the underlying setup; it all just works.

Listing 10.11.



BOOL WINAPI DllMain(HINSTANCE, DWORD reason, void *)


{


  switch(reason)


  {


    case DLL_PROCESS_ATTACH:


      Tss_Init();


      break;


    case DLL_THREAD_ATTACH:


      Tss_ThreadAttach();


      break;


    case DLL_THREAD_DETACH:


      Tss_ThreadDetach();


      break;


    case DLL_PROCESS_DETACH:


      Tss_Uninit();


      break;


  }


  . . .

On UNIX, the library calls pthread_key_create() from within Tss_Init() to create a private, unused key whose only purpose is to ensure that the library receives a callback when each thread terminates, which then calls Tss_ThreadDetach(). Since there is no mechanism for a per-thread initialization function in PTHREADS, the Tss library is written to act benignly when asked for data for a nonexistent slot, and to create a slot where one does not exist when asked to store a slot value. Thus, Tss_ThreadAttach() can be thought of as a mechanism for efficiently expanding all active keys in response to a thread's commencement, rather than doing it piecemeal during thread processing.

If you're not using PTHREADS or Win32, or you're not happy to locate your library in a Win32 DLL, you should ensure that all threads call the attach/detach functions. However, even if you cannot or will not do this, the library is written such that when the final call to Tss_Uninit() is received, it performs the registered cleanup for all outstanding slots for all keys.

This is a powerful catchall mechanism, and the only problem you'll have relying on this—apart from tardy cleanup, that is—is if your cleanup function must be called from within the same thread to deallocate the resource that was used to allocate it. If that's the case, and you can't ensure timely and comprehensive thread attach/detach notification, then you're out of luck. What do you want—we're only protoplasm and silicon!

10.5.5 TSS Performance

So far I've not mentioned performance. Naturally, the sophistication of the library, along with the fact that there is a mutex to serialize access to the store, means that it has a nontrivial cost when compared with, say, the Win32 TLS implementation, which is very fast indeed [Wils2003f]. In one sense, there's not an issue, since if you need this functionality, then you're going to have to pay for it somehow. Second, the cost of a thread switch is considerable, potentially thousands of cycles [Bulk1999], so some expense in TSS will probably be a lesser concern. However, we cannot dismiss the issue. Measures you can take to minimize the costs of the Tss library, or any TSS API, are to pass TSS data down through call chains, rather than have each layer retrieve it by itself and thereby risk precipitating context switches due to contention in the TSS infrastructure. Obviously this cannot be achieved with system library functions, or other very general or widely used functions, but is possible within your own application implementations.

Further, for the Tss store it's good to use the TssSlotScope template, since it will only attempt to update the slot when the value needs to be changed.

As usual, the choice of strategies is yours, representing a trade-off between performance, robustness, ease of programming, and required functionality.