Imperfect C++ Practical Solutions for Real-Life Programming By Matthew Wilson
	Table of Contents

	Chapter 32. Memory

32.2. The Best of Both Worlds

Let's just put the minor problems of these memory schemes aside for a moment: stack expansion is handled for you by your compiler and operating system; heap allocation failure can often be left to an overarching processwide error handler; memory leaks can be avoided by using memory buffer management classes employing RAII (see section 3.5); and fragmentation can be minimized by using custom pools (see section 32.3). But there are two unavoidable issues when using memory as provided by C and C++. If you use stack memory, you'll have very high speed but will need to know the size of the memory required at compile time, and you will not be able to change it. Conversely, if you use heap memory you can determine the size, and change the size, at run time, but you will suffer from a performance hit that, with some heap managers, can be considerable.

To put it crudely:^[3]

^[3] And bearing in mind that most other languages either have the same problems or don't have the guts to give developers access to stack memory.

Imperfection: C and C++ make you choose between performance and flexibility when allocating memory.

Naturally, this conflict is not one that any developer relishes, nor one that they tend to accept. We'll look in this section at various mechanisms that have been developed to cheat the system.

32.2.1 alloca()

Since stack variable memory is "allocated" by adjustment of the stack pointer in accordance with the compiler's dictates, it seems reasonable to wonder whether the adjustment has to be of a fixed size. Although it is not possible on all architectures [Stev1993], an attempt to merge the speed of stack memory with the flexibility of heap memory is the alloca() function (and its analogues, e.g., Visual C++'s _alloca()). The function allocates memory, whose size is determined at run time, from the stack rather than from the heap, by simply adjusting the stack pointer. The memory is automatically "freed" when the enclosing scope exits. This is a very useful facility, and where available and applicable provides the best solution in most cases of automatic buffers. Unfortunately, its mechanism means it has rather a large set of restrictions and caveats, which severely restrict its utility.

There are several minor disadvantages. First, in common with heap-memory allocation, it cannot give compile-time size, and it requires a local variable to keep track of memory size in circumstances where that is needed. Second, it cannot reallocate memory in the same manner that realloc() (and its platform-specific analogs) can, so in that way does not represent a direct replacement for heap memory APIs. Third, it is nonstandard, so not guaranteed to be available with all platforms/compilers: the code using it is not portable. Fourth, it has varied failure behavior: on Linux it returns NULL if it fails to allocate the requested memory, whereas on Win32 a machine-dependent exception is generated. Fifth, in common with compile time stack allocation, alloca() requires the use of stack-checking code (see section 32.1.2). Finally, there are implementation-dependent restrictions to the context of its use. For example, the Win32/Microsoft C run time _alloca() variant cannot be used from within certain exception handling contexts, as it can cause the system to crash unpredictably.

The two major disadvantages are even more dissuasive. First, by virtue of its mechanism, alloca() cannot be used to hold memory outside the current execution context, so it cannot be used for allocating variable sized blocks inside object instances, nor wrapped in template or other functions, for example, it is not possible to write a local_strdup().^[4] Second, and most important, because of its adjusting of stack pointers within the current function context, it can easily cause stack exhaustion when used in a function that performs a number of allocation/deallocation cycles. Indeed, the functions used to test the relative performances (see section 32.2.6) in this chapter quickly failed on both Linux and Win32 platforms for that reason. If it is not possible or practicable to move the alloca() call and the processing of the memory it returns into a separate function called within a loop—which may hurt performance too much to be useful anyway—then it should be avoided in such looping code.

^[4] Other than by using macros, that is.

32.2.2 VLAs

The C99 enhancements to the C-language specification included the new concept of Variable Length Arrays (VLAs) [Meye2001b], which (syntactically at least) addresses the issue of dynamically sized stack array variables. VLAs allow the dimensions of arrays to be determined at runtime, as in:



void func(int x, int y)


{


  int ar[x][y];


}

At this time VLAs are not a part of C++, although some compilers do support them with C++ as an extension. Digital Mars and GCC both provide them, as does Comeau, if used with a back-end compiler that supports them.

VLAs will likely be implemented via either alloca() (or a similar technique) or via heap memory. Digital Mars uses alloca(). Comeau C++ currently uses whatever VLA mechanism its backend compiler provides, although they're considering directly implementing VLA in terms of alloca(); this would widen the support considerably. GCC uses a stack-based technique, but it does not appear to be alloca(), as we will see shortly. So although the syntax of the language will be clearer than current use of either alloca() or the heap, the implications for performance, robustness, and availability will largely be the same, as we'll see.

Support for non-POD types is implementation dependent. GCC is the only one of our compilers (Appendix A) that currently does so. VLA objects, for POD types at least, are uninitialized. The C standard (C-99: 6.2.4; 6) says that: "the initial value of object[s] [of variable length array type] is indeterminate."

There are two disadvantages of VLAs. The first is that they are not (yet) part of standard C++, and the support for them is scant. Second, they are subject to the same performance and robustness characteristics as their underlying mechanisms, which can differ widely between implementations, as we will see.

Note that sizeof() can be applied to VLAs, but the result is evaluated at run time.

32.2.3 auto_buffer<>

The proposed solution for efficient, variable-sized automatic buffers is the auto_buffer template class^[5]—so named because of its primary intended use as automatic variables—as shown in Listing 32.1.

^[5] This is a part of the STLSoft libraries. The latest implementation, along with all test programs and full results, are available on the CD.

Listing 32.1.



template< typename  T


        , typename  A


        , size_t    SPACE = 256


        >


class auto_buffer


  : protected A


{


public:


  . . . // Usual member type declarations


// Construction


  explicit auto_buffer(size_type cItems);


  ~auto_buffer();


/// Operations


  bool  resize(size_type cItems);


  void  swap(class_type &rhs);


/// Operators


  operator      pointer ();


  pointer       data();


  const_pointer data() const;


/// Iteration


  // const / non-const, (r)begin() + (r)end() methods


/// Attributes


  size_type       size() const;


  bool            empty() const;


  allocator_type  get_allocator() const;


/// Members


private:


  value_type  *m_buffer;          // Pointer to used buffer


  size_type   m_cItems;           // Number of items in buffer


  value_type  m_internal[space];  // Internal storage


// Not to be implemented


private:


  . . . // Prevent copy construction / assignment


};

The purpose of the class is to emulate the syntax and semantics of a standard array variable for POD types and to maximize the flexibility and performance of stack and heap memory. To achieve this it allocates from its internal buffer where possible, otherwise allocating from the heap via the allocator type.

To be as compatible as is possible with normal raw array syntax and to work in cases where array decay (see Chapter 14) was prevalent, it provides an implicit conversion operator rather than index operators (value_type &operator [](), and const version). This supports both conversion to pointer and (implicit) array indexing, whereas the index operator approach on its own allows only (explicit) array indexing.^[6]

^[6] Though index operators would afford the possibility of some simple index validation (e.g., assert, or even throw an exception), usability won over principle in this case.

The implementation of the class is surprisingly simple. Almost all the action is in the constructor (see Listing 32.2). It takes a single argument: the requested number of elements in the array. This is tested against the size of the internal buffer, and if not larger, the m_buffer member is set to point to m_internal, the internal array. (This is fairly similar to the small string optimization [Meye2001].) If the requested size is larger, then a request is made to the allocator, setting m_buffer to the allocated block. (All accessor methods refer to m_buffer, in either case.)

Listing 32.2.



  . . .


  explicit auto_buffer(size_type cItems)


    : m_buffer((space < cItems) ? alloc_(cItems) : &m_internal[0])


    , m_cItems((NULL != m_buffer) ? cItems : 0)


  {


    STATIC_ASSERT(space != 0);


    STATIC_ASSERT( offsetof(class_type, m_buffer)


                 < offsetof(class_type, m_cItems));


    constraint_must_be_pod(value_type);


  }


  ~auto_buffer()


  {


    if(space < m_cItems)


    {


      assert(NULL != m_buffer);


      dealloc_(m_buffer, m_cItems);


    }


  }


  . . .

In the latter case, it is possible for the allocation to fail. Because an important requirement for the class, and for the STLSoft libraries as a whole, is to be as widely compatible as possible, the constructor is written to work correctly both in situations where allocation failures result in an exception being thrown, and in those where the allocate() method returns NULL. When an exception is thrown, it is propagated to the caller of the auto_buffer constructor, and the instance of the auto_buffer is not constructed. Some allocators do not throw exceptions when they fail to secure enough memory for the requested allocation, returning NULL instead. Also, when creating small programs it may be undesirable to compile/link in exception handling mechanisms, in which case one may deliberately plug in a NULL-on-failure allocator. In such circumstances it is prudent to leave the auto_buffer in a coherent state, therefore, the initializing condition for m_cItems discriminates on whether m_buffer is non-NULL. In the case where NULL is returned, the remaining construction of the auto_buffer instance results in initialization of the m_cItems member to 0, and thereby provides sensible and correct behavior for the use of this empty instance, namely, that begin() == end(), empty() returns true and size() returns 0.

Note that this uses the fragile technique of relying on the relative member (initialization) order of m_buffer and m_cItems. To that end there are assertions (static/compile time and dynamic/run time) in the constructor to guard against any maintenance edits that are not mindful of this requirement and might change this ordering, resulting in the discrimination against the value of m_buffer testing garbage and the classic undefined results (i.e., a crash). Such member ordering dependencies are not generally a good idea, but I chose to use the technique here as it allows me to declare m_cItems as const, and the assertions ensure that all is well.^[7]

^[7] Practice wins out over theory here. In principle offsetof() is undefined for non-POD types, so applying it here is not strictly valid. However, for all compilers supported by STLSoft it does what's required, so it is used. If STLSoft is ported to a compiler that lays out classes such that this would not apply, then auto_buffer<>, or the macro, will be rewritten accordingly.

It is important to note that auto_buffer's constructor performs only the allocation of memory, it does not in-place construct any elements, in common with built-in arrays (of POD type) and VLAs. Nor does its destructor destroy the elements. To ensure that it's not (mis)used with non-POD types, we use the constraint constraint_must_be_pod() (see section 1.2.4). While the m_internal array member would, in and of itself, prevent the compilation of parameterizations of auto_buffer with classes that do not provide publicly accessible default constructors, default-constructible class types could still be used, and so the constraint is necessary. Note that there's also a compile-time assertion to prevent someone from perversely parameterizing it with an internal size of 0.

The constructor is explicit in common with good practice (though it's hard to conceive of an implicit conversion scenario against which to guard). The destructor has a straightforward implementation. By testing m_cItems against the size of the internal buffer, it determines whether m_buffer points to m_internal and, if not, frees the heap memory by calling the allocator's deallocate() method.

The class also provides basic STL container methods empty(), size(), begin() and end(). begin() and end() are added purely as a programmatic convenience and should not be taken to mean that auto_buffer is a full STL container. It is not, since it does not (currently) work with instances of non-POD types.

32.2.4 Using the Template

Using the template requires parameterization of the element type, the allocator type and, optionally, the size of the internal array (m_internal). As mentioned earlier, the class is designed such that it does not prescribe a particular memory allocation scheme to support its semantics, as long as the allocator supports the STL Allocator concept [Aust1999, Muss2001]. Any compliant allocator may be used, providing client code with maximal flexibility, conforming to STL good practice. The size of m_internal, measured in number of elements rather than bytes, is given by the third parameter, which defaults to 256. Client code can specify any size here to best match the most common required array size for maximum performance benefit.



int some_func(char *s, size_t len)


{


  typedef auto_buffer<char, std::allocator<char> > buffer_t;





  buffer_t buf(1 + len);


  strncpy(&buf[0], s, buf.size());


  . . .

Note that the use of buf in the call to strncpy() uses the size() method, to handle the case where allocation fails and the allocator returns NULL rather than throws an exception. In real cases, you'd have to do more than just not write anything in such a case, unless you wanted to surprise client code of some_func().

32.2.5 EBO, Where Art Thou?

For all compilers other than Borland, auto_buffer derives from the allocator, and takes advantage of EBO (see section 12.3) when appropriate. However, with Borland this causes such significant performance degradation (actually worse than the malloc()/free() scenario in all cases) that an allocator is instead shared between all instances, as shown in Listing 32.3:

Listing 32.3.



template< . . .


        >


class auto_buffer


#ifndef ACMELIB_COMPILER_IS_BORLAND


  : protected A


#endif /* !ACMELIB_COMPILER_IS_BORLAND */


{


  . . .


#ifdef ACMELIB_COMPILER_IS_BORLAND


  static allocator_type &get_allocator()


  {


    static allocator_type   s_allocator;


    return s_allocator;


  }


#else


  allocator_type        get_allocator() const


  {


    return *this;


  }


  . . .

Thankfully, the STL Allocator concept [Aust1999, Muss2001] dictates that allocators are not allowed to act as if they have per-instance data, and most actually do not, but if they did there's a very slight possibility of a multithreading race condition here, which is not pleasant. If you wished to be ultrasafe, you could apply a spin mutex here (see section 10.2.2).

32.2.6 Performance

The performance of auto_buffer was tested against the following memory allocation types: stack variables; heap using malloc()/free(); heap using operator new/delete; dynamic stack allocation using alloca()/_alloca(); VLAs; std::vector. For each allocation type, the program allocates a block, accesses a byte within it (to prevent the compiler optimizing away the loop), and then deallocates it. The operation is repeated for a given number of times determined by the second program parameter. Since the program's parameterization accepts the default 256 for the internal buffer size, the two sizes tested are above and below this, specifically 100 and 1,000 bytes, repeated 10 million times.^[8]

^[8] Tests of 10 and 100 bytes with a 64-byte buffer size yielded virtually identical relative results for all compilers.

Because auto_buffer does a test (comparing the size of its internal buffer against the requested buffer size), prior to making any heap allocation, in circumstances where it must allocate from the heap the performance will be less than going straight to the heap. Therefore the purpose of the performance test is to quantify the presumed superiority where it uses its internal buffer and the presumed inferiority where it allocates from the heap.

The test was carried out with our ten Win32 compilers (see Appendix A). The resultant times were normalized with respect to the time taken for malloc()/free() for each compiler, and are expressed in percentage terms.^[9] Table 32.1 shows the performance for the internal allocation. The figures clearly show that a significant advantage can be gained when the size of the allocation is smaller than that of the internal buffer, with the cost being between 1% and 54% of that of malloc()/free(). The average cost was 8.6 percent, and discounting the very poor performance of Borland it falls to 2.9% —very close to that of stack memory.

^[9] The full results and test program are available on the CD.

Table 32.1. Performances (relative to malloc()) of schemes for 100-element allocations.
Memory
Slowest (Borland)
Fastest (Intel)
GCC
Average
Average (- Borland)
Automatic
2.2%
0.2%
0.3%
0.9%
0.7%
malloc()
100.0%
100.0%
100.0%
100.0%
100.0%
operator new
178.3%
98.6%
105.9%
109.3%
100.7%
vector<>
386.8%
335.3%
121.6%
153.2%
124.0%
auto_buffer<>
54.2%
1.2%
2.8%
8.6%
2.9%
alloca()
*
*
*
*
*
VLA
-
-
1.7%
-
-

For figures marked * it proved impossible to obtain meaningful performance figures due to stack exhaustion crashes on any useful level of loop repeats. This was the case for all alloca() implementations as well as VLAs with Digital Mars (which is expected given its implementation over alloca()). The interesting thing is that GCC's VLA did not suffer stack exhaustion (as its alloca() did), which implies it uses a different technique. Though slower than stack memory, it was nearly twice as fast as auto_buffer. My suspicion, therefore, is that it adjusts the stack pointer back when the scope, rather than the function, is exited, thus avoiding the exhaustion.

Table 32.2 shows the performance for allocation larger than the internal buffer size. In this case, performance ranges between 101% and 275%. The average is 123%, or just 104% if we remove Borland. Clearly, the wise programmer will be able to select a template size parameter such that the vast majority of allocations will fit within the internal size, which will result in significant net performance gains overall.

Table 32.2. Performances (relative to malloc()) of schemes for 1000-element allocations.
Memory
Slowest (Borland)
Fastest (Digital Mars)
GCC
Average
Average (- Borland)
Automatic
2.0%
0.5%
0.3%
0.8%
0.6%
malloc()
100.0%
100.0%
100.0%
100.0%
100.0%
operator new
184.7%
101.8%
105.6%
112.0%
102.9%
vector<>
642.7%
614.8%
162.7%
608.5%
604.2%
auto_buffer<>
275.2%
101.4%
107.1%
122.7%
103.7%
alloca()
*
*
*
*
*
VLA
-
*
1.6%
-
-

It's worth noting that the vector performance is up to 370% and up to 2500% (Intel, with Visual C++ standard library, not shown) for the two scenarios, so that there's no doubt that auto_buffer is vastly more efficient than vector. But remember that auto_buffer does support non-POD types, and it does not initialize the memory it acquires. Furthermore it does not provide any of the insertion, removal, or other sophisticated operations of the Vector model (C++-98: 23.2.4). All of this is by design—I wanted one-time allocation,^[10] and I wanted it fast—and so it's fair to draw a comparison with vector only in these circumstances. auto_buffer is not a replacement for vector, and is not intended to be.

^[10] As auto_buffer has become more established and widely used, it's been upgraded with resize() and swap() methods (see Listing 32.1), which makes it much more useful when writing exception safe classes (that use construct-and-swap [Sutt2000]). These enhancements don't affect its performance and fundamental simplicity.

It's worth noting that most operating systems provide a variety of memory allocators, which may provide more efficient or thread-friendly memory allocation, for example, libumem on Solaris, so it's possible that you may be able to get much better performance than malloc(). I doubt very much whether they'll provide performance comparable with auto_buffer and VLAs, but performance facts are only established by performance testing.

32.2.7 Heap vs. Stack vs. . . .

Including the quantitative results, we can draw up a table (Table 32.3) of the relative merits of the various schemes. Clearly, when used wisely, auto_buffer has the best mix of features of all the schemes for automatic arrays of fundamental and POD types.

Table 32.3. Feature summary of the memory allocation schemes.

Stack
Heap
alloca()
VLAs
vector<>
auto_buffer<>
Size determinable at run time
No
Yes
Yes
Yes
Yes
Yes
Resizable
No
Yes
No
No
Yes
Yes
Prone to stack exhaustion
Yes
No
Yes
No
No
Not if used correctly
Precipitates stack checking
Yes
No
Yes
No
No
Not if used correctly
Possible failure to allocate
No
Yes
Yes
Yes
Yes
Yes
All platforms/compilers
Yes
Yes
No
No
Yes
Yes
Efficiency
Highest
Low
High
Low
Very low
High
Compiled-in memory size
Yes
No
No
No, but sizeof() is applied at run time
No, but size() is inline
No, but size() is inline
Can support non-POD class types
Yes
Yes
No
No
Yes
No
Usable in composition
Yes
Yes
No
No
Yes
Yes

In addition, auto_buffer can be used in composition, although care must be exercised in such circumstances. While it can provide similar performance advantages, it should only be done in cases where the most common allocation size equals or is close to the internal buffer size. If the typical size is larger than the internal buffer, then most instances of the class will not use their internal buffer at all. If the typical size is significantly less than the internal buffer, then most instances of the class will not use most of their buffer. In either case, when the containing class is on the heap, then the auto_buffer and its internal buffer will also exist on the heap, potentially wasting a large amount of memory.

It's best to just stick to using it for auto variables. It is auto-buffer, after all!

32.2.8 pod_vector<>

Given the great performance savings but low-level nature of auto_buffer, it begs the question^[11] whether we can apply the technique in a more general form. The answer is that we can, and it takes the form of the pod_vector template class (see Listing 32.4). pod_vector provides an implementation of the Vector model (C++-98: 23.2.4), restricted to POD types. Like auto_buffer, it uses an internal buffer of parameterizable size. In fact, it is implemented in terms of auto_buffer, requiring only a single additional member m_cItems, to represent the size() of the vector; the size of the auto_buffer member represents the capacity() of the vector.

^[11] Actually, Chris Newcombe posed the question and provided the impetus to develop pod_vector.

Listing 32.4.



template< typename  T


        , typename  A


        , size_t    SPACE = 64


        >


class pod_vector


{


/// Typedefs


private:


  typedef auto_buffer<T, A, SPACE>              buffer_type;


public:


  typedef typename buffer_type::value_type      value_type;


  typedef typename buffer_type::allocator_type  allocator_type;


  typedef pod_vector<T, A, SPACE>               class_type;


  . . .


/// Construction


  explicit pod_vector(size_type cItems = 0);


  pod_vector(size_type cItems, value_type const &value);


  . . .


/// Operations


  . . .


  void    resize(size_type cItems);


  void    resize(size_type cItems, value_type const &value);


  . . .


/// Members


private:


  size_type   m_cItems;


  buffer_type m_buffer;


};

pod_vector has two areas of potential performance advantage over standard library implementations of vector. First, it uses an auto_buffer to implement its storage. So long as it is parameterized and used appropriately, such that the net performance gains of auto_buffer are accessible, this will represent a significant optimization.

The other factor is that, because the vector contains only POD types, there is no requirement to destroy elements when they are removed from the vector. For example, the pop_back() method simply decrements m_cItems.

Despite these great potentials, the class will only provide significant performance advantages in certain circumstances, because misuse of auto_buffer results in performance costs. Table 32.4 shows the results from a set of tests—that I do not suggest are in any way exhaustive—written to help me performance tune the implementation. pod_vector<int,..., 64> and std::vector<int> were each used in a series of operations, acting on 50 and 100 elements, to exercise the auto_buffer's internal threshold. As you can see, there are some significant performance gains to be had, even in some of the 100 cases. However, what is equally clear is that the relative performance of the two containers is strongly dependent on the type of operation, and also on the compiler used. Even if it may not be too surprising to see the Borland performance figures trailing a little, both CodeWarrior and GCC also have a few concerning data points. But it's also clear that there's great potential for performance gains.

Table 32.4. Performance of pod_vector<> relative to vector<>.
Test
Borland
CodeWarrior
GCC
Intel
VC 7.1
insert(begin()) - 50
69.1%
93.6%
70.6%
73.4%
64.2%
insert(begin()) - 100
78.9%
172.9%
92.5%
105.6%
89.5%
push_back - 50
132.4%
34.1%
37.8%
23.8%
12.3%
push_back - 100
188.4%
56.1%
72.2%
36.9%
22.7%
push_back + reserve - 50
331.5%
90.5%
168.6%
50.5%
27.8%
push_back + reserve - 100
374.2%
117.1%
266.2%
66.6%
38.4%
erase(begin()) - 50
22.9%
24.9%
50.2%
49.1%
31.9%
erase(begin()) - 100
15.7%
25.1%
30.6%
38.2%
26.0%
erase(&back()) - 50
59.4%
30.3%
86.7%
38.0%
28.2%
erase(&back()) - 100
78.2%
67.5%
188.9%
76.5%
79.6%
erase block - 50
78.5%
62.4%
72.7%
40.4%
34.6%
erase block - 100
85.5%
80.6%
88.0%
48.4%
44.9%

My suggestion with a class such as pod_vector is to write your application using std::vector, and then plug in pod_vector during performance testing. Since it implements the STL Vector [Muss2001] model, as does std::vector, this will be as simple as changing a single line. This is one of the beauties of the STL.

32.2. The Best of Both Worlds

32.2.1 alloca()

32.2.2 VLAs

32.2.3 auto_buffer<>

Listing 32.1.

Listing 32.2.

32.2.4 Using the Template

32.2.5 EBO, Where Art Thou?

Listing 32.3.

32.2.6 Performance

Table 32.1. Performances (relative to `malloc()`) of schemes for 100-element allocations.

Table 32.2. Performances (relative to `malloc()`) of schemes for 1000-element allocations.

32.2.7 Heap vs. Stack vs. . . .

Table 32.3. Feature summary of the memory allocation schemes.

32.2.8 pod_vector<>

Listing 32.4.

Table 32.4. Performance of `pod_vector<>` relative to `vector<>`.

32.2. The Best of Both Worlds

32.2.1 alloca()

32.2.2 VLAs

32.2.3 auto_buffer<>

Listing 32.1.

Listing 32.2.

32.2.4 Using the Template

32.2.5 EBO, Where Art Thou?

Listing 32.3.

32.2.6 Performance

Table 32.1. Performances (relative to malloc()) of schemes for 100-element allocations.

Table 32.2. Performances (relative to malloc()) of schemes for 1000-element allocations.

32.2.7 Heap vs. Stack vs. . . .

Table 32.3. Feature summary of the memory allocation schemes.

32.2.8 pod_vector<>

Listing 32.4.

Table 32.4. Performance of pod_vector<> relative to vector<>.

Table 32.1. Performances (relative to `malloc()`) of schemes for 100-element allocations.

Table 32.2. Performances (relative to `malloc()`) of schemes for 1000-element allocations.

Table 32.4. Performance of `pod_vector<>` relative to `vector<>`.