Section 10.5. Blitz++ and Expression Templates

10.5. Blitz++ and Expression Templates

Blitz++ [Veld95a], a library for high-performance array math, pioneered so many of the techniques and ideas used in this book that it would be hard to overestimate its influence on the world of C++ metaprogramming. It was the first C++ library to use explicit metaprogramming,^[5] and the first to implement a domain-specific embedded language. We can't possibly touch on all aspects of Blitz++, so we're going to look at the central innovation: expression templates [Veld95b].

^[5] By "explicit metaprogramming" we mean treating template instantiations as first-class compile-time programs. Explicit metaprogramming goes well beyond the sort of trivial type manipulations required for most generic programming, such as accessing the value_type of an iterator through std::iterator_traits. Although that could technically be seen as a metafunction invocation, most generic programmers don't think of it that way, and it's one's relationship to the code, as much as anything else, that defines metaprogramming.

10.5.1. The Problem

If we had to boil the problem solved by Blitz++ down to a single sentence, we'd say, "A naive implementation of array math is horribly inefficient for any interesting computation." To see what we mean, take the boring statement



    x = a + b + c;

where x, a, b, and c are all two-dimensional Arrays. The canonical implementation of Array's addition operator is:



    Array operator+(Array const& a, Array const& b)


    {


         std::size_t const n = a.size();


         Array result;





         for (std::size_t row = 0; row != n; ++row)


             for (std::size_t col = 0; col != n; ++col)


                 result[row][col] = a[row][col] + b[row][col];





         return result;


    }

To evaluate the expression a + b + c using that operator, we first compute a + b, resulting in a temporary Array (call it t), and then we evaluate t + c to produce the final result.

The problem is that temporary, t. The efficient way to perform this computation is to step through each position in all three input arrays at once, adding the three elements at that position and placing their sum in the result:



    for (std::size_t row = 0; row != n; ++row)


       for (std::size_t col = 0; col != n; ++col)


           result[row][col] = a[row][col] + b[row][col] + c[row][col];

The temporary not only costs an extra dynamic memory allocation for its element storage, but causes the CPU to make two complete traversals of that storage: one to write the result of a + b, and another to read the input for t + c. As anyone who has done high-performance numerics knows, these two traversals are the real killer, because they destroy cache locality. If all four of the named arrays nearly fill the cache, introducing t effectively pushes one of them out.

The problem here is that the operator+ signature above is just too greedy: It tries to evaluate a + b just as soon as it can, rather than waiting until the whole expression, including the addition of c, is available.

10.5.2. Expression Templates

In the expression's parse tree, evaluation starts at the leaves and proceeds upwards to the root. What's needed here is some way of delaying evaluation until the library has all of the expression's parts: that is, until the assignment operator is executed. The stratagem taken by Blitz++ is to build a replica of the compiler's parse tree for the whole expression, allowing it to manage evaluation from the top down (see Figure 10.1).

Figure 10.1. Parse tree for `x = a + b + c`

This can't be any ordinary parse tree, though: Since array expressions may involve other operations like multiplication, which require their own evaluation strategies, and since expressions can be arbitrarily large and nested, a parse tree built with nodes and pointers would have to be traversed at runtime by the Blitz++ evaluation engine to discover its structure, thereby limiting performance. Furthermore, Blitz++ would have to use some kind of runtime dispatching to handle the different combinations of operation types, again limiting performance.

Instead, Blitz++ builds a compile-time parse tree out of expression templates. Here's how it works in a nutshell: Instead of returning a newly computed Array, operators just package up references to their arguments in an Expression instance, labeled with the operation:



    // operation tags


    struct plus; struct minus;





    // expression tree node


    template <class L, class OpTag, class R>


    struct Expression


    {


        Expression(L const& l, R const& r)


          : l(l), r(r) {}





        float operator[](unsigned index) const;





        L const& l;


        R const& r;


    };





    // addition operator


    template <class L, class R>


    Expression<L,plus,R> operator+(L const& l, R const& r)


    {


        return Expression<L,plus,R>(l, r);


    }

Notice that when we write a + b, we still have all the information needed to do the computationit's encoded in the type Expression<Array,plus,Array>and the data is accessible through the expression's stored references. When we write a + b + c, we get a result of type:



    Expression<Expression<Array,plus,Array>,plus,Array>

and the data is still accessible through the nested references. The interesting thing about the Expression class template is that, just like an Array, it supports indexing via operator[]. But wait! Didn't we just tell you that operator+ computes nothing, and Expression just stores references to its arguments? Yes, we did. If the result of the operation isn't stored in the Expression, it must be computed lazily by operator[].

To see how it works, check out this simplified implementation for one-dimensional Arrays of floats. First, to associate the elementwise arithmetic logic with the operation tags, we'll nest some static member functions:



    // operation tags implement elementwise arithmetic


    struct plus


    {


      static float apply(float a, float b)


        { return a + b; }


    };





    struct minus


    {


      static float apply(float a, float b)


        { return a - b; }


    };

Next, we'll give the Expression an indexing operator that calls its tag's apply function to compute the appropriate element value:



    // expression tree node


    template <class L, class OpTag, class R>


    struct Expression


    {


        Expression(L const& l, R const& r)


          : l(l), r(r) {}





        float operator[](unsigned index) const


        {


          return OpTag::apply(l[index], r[index]);


        }


        L const& l;


        R const& r;


    };

This seems almost too simple, right? Amazingly, we now have fully lazy expression evaluation. To see it at work, let's walk through the evaluation of (a + b)[1]. Since the type of a + b is Expression<Array,plus,Array>, we have:



    (a + b)[1]


      == plus::apply(a[1], b[1])


      == a[1] + b[1]

Now consider what we'd go through to evaluate the same expression with a greedy strategy. That's right, we'd have to compute a temporary array, (a + b), only to throw out all but one element! The contrast in efficiency couldn't be more striking.

Naturally, (a + b + c)[1] is also computed without any temporary Arrays:



    (a + b + c)[1]


      == ((a + b) + c)[1]


      == plus::apply((a + b)[1], c[1])


      == plus::apply(plus::apply(a[1], b[1]), c[1])


      == plus::apply(a[1] + b[1], c[1])


      == (a[1] + b[1]) + c[1]

All that remains now is to implement Array's assignment operator. Since we can access any single element of the result Expression without ever creating a temporary Array, we can compute the whole result by accessing every element of the expression:



    template <class Expr>


    Array& Array::operator=(Expr const& x)


    {


        for (unsigned i = 0; i < this->size(); ++i)


           (*this)[i] = x[i];


        return *this;


    }

That's it! Naturally, there's a lot more to array math than addition and subtraction, and Blitz++ has to consider all kinds of things that are not handled by our simple example, from operations like multiplication to "tiling" large array operations so that they stay within the cache. The basic technique of delaying expression evaluation, however, is the tool that allows the library to do all these things with near-optimal efficiency.^[6]

^[6] If it seems to you that we've just demonstrated a way to abuse C++ operator overloading, we plead guilty! In fact, we're going to spend much of this chapter looking at creative ways to "abuse" the operators. We hope that by the end, you'll see these techniques as legitimate and well-founded programming paradigms.

As a DSL, this part of Blitz++ is deceptive in its smoothness: The syntax looks exactly as you'd expect it to in a naive implementation, but you can see that behind the syntax lives a highly specialized evaluation engine, tuned for the Blitz++ domain.

Intermediate Results

One drawback of expression templates is that they tend to encourage writing large, complicated expressions, because evaluation is only delayed until the assignment operator is invoked. If a programmer wants to reuse some intermediate result without evaluating it early, she may be forced to declare a complicated type like:

Expression< Expression<Array,plus,Array> , plus , Expression<Array,minus,Array> > intermediate = a + b + (c - d);

(or worse). Notice how this type not only exactly and redundantly reflects the structure of the computationand so would need to be maintained as the formula changesbut also overwhelms it? This is a long-standing problem for C++ DSELs. The usual workaround is to capture the expression using type erasure (see Chapter 9), but in that case one pays for dynamic dispatching.

There has been much discussion recently, spearheaded by Bjarne Stroustrup himself, about reusing the vestigial auto keyword to get type deduction in variable declarations, so that the above could be rewritten as:

auto intermediate = a + b + (c - d);

This feature would be a huge advantage to C++ DSEL authors and users alike.

10.5.3. More Blitz++ Magic

Just in case you find it hard to see the domain-specific language in what we've already covered, here are just a couple more of Blitz++'s syntactic innovations that we think you'll find more striking.

10.5.3.1 Array Initialization

Because Blitz++ Arrays are not what the C++ standard calls an "aggregate" (see section 8.5.1 of the Standard), we can't use the convenient syntax of listing initializers within braces, as we can with ordinary built-in arrays. Instead, Blitz++ overloads the comma operator to make a similar syntax possible:



    Array<float,2> A(3,3);


    A = 1, 2, 3,


        4, 5, 6,


        7, 8, 9;

10.5.3.2 SubArray Syntax

Blitz++ has a Range class that encapsulates a sequence of indices. When an Array is indexed using a Range, a lazy SubArray view is produced without copying any elements:^[7]

^[7] Note that Blitz++, like most array packages, uses operator() instead of operator[] for indexing, because operator() allows multiple arguments whereas operator[] does not.



    // add the first two rows and columns of A to B


    B += A(Range(0,2), Range(0,2))

The exciting thing about Blitz++'s Range objects is that you can also perform arithmetic on them directly, resulting in expressions that look remarkably like the body of a multiply-nested loop, using a technique pioneered by the Math.h++ library [KV89]. This example is taken from the online Blitz++ manual:



    // a three-dimensional stencil (used in solving PDEs)


    Range I(1,6), J(1,6), K(1,6);


    B = (A(I,J,K) + A(I+1,J,K) + A(I-1,J,K) + A(I,J+1,K)


     + A(I,J-1,K) + A(I,J+1,K) + A(I,J,K+1) + A(I,J,K-1)) / 7.0;

This sort of notational simplification has proven itself to be more than mere syntactic sugar. Similar techniques have been used to reduce the evaluation of complex tensor equations from unreadable and error-prone code resembling FORTRAN (called "C++tran" by Scott Haney) to one-liners resembling the equations in the original theory [Land01]. Projects that would be practically impossible to complete correctly using C++tran suddenly become tractable with a DSEL.