Section 10.7. The Boost Spirit Library

10.7. The Boost Spirit Library

Like YACC, Spirit is a framework for defining parsers. The main difference is that rather than compiling to intermediate C/C++ code, Spirit uses an embedded domain-specific language. Here's the meat of the expression grammar we implemented with YACC, using Boost.Spirit's embedded DSL syntax:



    group       = '(' >> expression >> ')';


    factor      = integer | group;


    term        = factor >> *(('*' >> factor) | ('/' >> factor));


    expression  = term >> *(('+' >> term) | ('-' >> term));

You'll notice that there are some differences from traditional EBNF. The most obvious is probably that, because sequences of consecutive values like



    '(' expression ')'

don't fit into the C++ grammar, the author of Spirit had to choose some operator to join consecutive grammar symbols. Following the example of the standard stream extractors (which do, after all, perform a crude kind of parsing), he chose operator>>. The next difference worth noting is that the Kleene star (*) and positive closure (+) operators, which are normally written after the expressions they modify, must be written as prefix operators instead, again because of limitations of the C++ grammar. These minor concessions aside, the Spirit grammar syntax comes remarkably close to the usual notation of the domain.

Spirit is actually a great example of the power of DSELs to interoperate with one another, because it really consists of a collection of little embedded languages. For example, the following complete program brings the above grammar together with semantic actions written between [...] using the Phoenix functional programming DSEL, and another DSEL idiom Spirit calls closures:



    #include <boost/spirit/core.hpp>


    #include <boost/spirit/attribute.hpp>


    #include <iostream>


    #include <string>





    using namespace boost::spirit;


    using namespace phoenix;


    // provides one named variable of type int...


    struct vars : boost::spirit::closure<vars, int>   // CRTP


    {


        member1 value; // ...called "value" in lazy expressions


    };


    // calculator is a grammar with attached int called "value"


    struct calculator


      : public grammar<calculator, vars::context_t>   // CRTP


    {


        template <class Tokenizer>


        struct definition


        {


          // all our rules have an attached int called "value," too...


            rule<Tokenizer, vars::context_t>


              expression, term, factor, group, integer;





          // ...except the top rule


            rule<Tokenizer> top;





          // build the grammar


            definition(calculator const& self)


            {


                top = expression[self.value = arg1];





                group = '(' >> expression[group.value = arg1] >> ')';





                factor = integer[factor.value = arg1]


                    | group[factor.value = arg1]


                    ;





                term = factor[term.value = arg1]


                       >> *(   ('*' >> factor[term.value *= arg1])


                             | ('/' >> factor[term.value /= arg1])


                           )


                    ;





                expression = term[expression.value = arg1]


                     >> *(   ('+' >> term[expression.value += arg1])


                           | ('-' >> term[expression.value -= arg1])


                         )


                     ;





                integer = int_p[integer.value = arg1];


            }


                     // tell Spirit to start parsing with "top"


                       rule<Tokenizer> const& start() const { return top; }


                   };


          };





          int main()


          {


              calculator calc;    // our grammar





              std::string str;


              while (std::getline(std::cin, str))


              {


                 int n = 0;


                 parse(str.c_str(), calc[var(n) = arg1], space_p);


                 std::cout << "result = " << n << std::endl;


              }


          }

10.7.1. Closures

We're going to describe closures at two levels: First we'll examine them from the point of view of the DSL user, asking you to put aside any consideration of how the magic works, and then we'll look at the implementation techniques.

10.7.1.1 The Abstraction

To understand the use of closures, it's important to know that Spirit grammars and rules are allyou guessed itfunction objects. When invoked with an appropriate pair of iterators over the input, rules and grammars attempt to parse it. This leads to top down or recursive descent parsing. For example, in order to parse its first symbol, the expression rule in turn invokes the term rule.

Closures provide a set of variables associated with each rule invocation, accessed as members of the rule itself. The value of the first member of the closure (in our example there is only one: value) becomes the rule's "return value" and when the rule is used on the right-hand-side of another rule, may be accessed in semantic actions attached to the rule by using the Phoenix placeholder arg1. So, for example, in



    term = factor[term.value = arg1]


           >> *(   ('*' >> factor[term.value *= arg1])


                 | ('/' >> factor[term.value /= arg1])


               )


        ;

the value associated with the first factor invocation is first moved into the value associated with the current term invocation. Then, as each member of the Kleene star repetition is parsed, the value associated with the current term invocation is modified accordingly.

The really interesting thing about closures is the way they enable yet another programming paradigm: dynamic scoping. In C++, unqualified names (those without a "::" prefix) usually refer to the innermost enclosing scope in which they're defined:



    #include <iostream>





    namespace foo


    {


      int x = 76;





      int g()


      {


         return x + 1;                // refers to foo::x


      }


    }


    int main()


    {


       int x = 42;


       std::cout << foo::g();         // prints 77


    }

In dynamic scoping systems, though, names refer to the nearest scope on the call stack in which they're defined. Therefore, in the same code, foo::g would see the value of x that was established in main(), and the program would print 43.

The fully qualified names of closure variables (rulename.membername) are dynamically scoped. That means, for example, that any semantic action in our grammar can refer to expression.value, and in doing so, can reach up the call stack to the value associated with the nearest invocation of the expression rule.

10.7.1.2 Implementation Details

Take a look at the declaration of our closure:



    struct vars : closure<vars, int>


    {


        member1 value;


    };

The first thing you'll probably notice is that closure uses the "Curiously Recurring Template Pattern" (covered in Chapter 9), so that it has access to the type of vars. More interesting, though, is the use of the special type member1 for value.

Clearly, if the library lets you write rulename.closure-variable, rules must contain public data members with the same names as closure variables. Actually, the limitations of the C++ language give us a big hint as to what's going on here: The only way to automatically allow the closure data members to be addressed as rule data members of the same name is to make the closure a public base of the rule class itself. There's simply no other way to generate identically named public data members in the rule.

Just as clearly, if something like expression.value += 1 is to work as expected, expression.value can't be of type int: that would cause an integer addition immediately, as our grammar is defined, instead of later, when its rules are invoked. Sound like a familiar issue? In fact, it is solved in a familiar way, with expression templates. Instead of performing an addition, expression.value += 1 creates an object that, when suitably invoked by the parser, adds 1 to the int variable created for value in the stack frame of the nearest enclosing expression invocation.

We're not going to go into the nitty-gritty details of how the dynamic scoping mechanism is implemented, as it's not directly related to the "DSEL-ness" of closureswe suggest you look at the Spirit and Phoenix source code if you're curious. The important thing to recognize is that, once again, expression templates and delayed evaluation have allowed us to use a programming paradigm not directly available in native C++.

10.7.2. Subrules

If you look closely at our calculator grammar, you can see that there must be some type erasure at work.^[12] Since the expression on the right side of each rule assignment builds an efficient function object, we can expect the types of these function objects to represent the structure of the expression. For example, leaving out the effect of the semantic actions, which further complicate things, the type of the right-hand-side of the factor rule is something like:

^[12] See Chapter 9 for more information on type erasure.



    alternative<


        rule<Tokenizer, vars::context_t> // for integer rule


      , rule<Tokenizer, vars::context_t> // for group rule


    >

and the right-hand-side of the group rule has a type something like this one:



    sequence<


        sequence<


            ch_p                             // for '(' parser


          , rule<Tokenizer, vars::context_t> // expression rule


        >


      , ch_p                                 // for ')' parser


    >

The factor and group rules themselves, though, have the same type. Clearly the compile time polymorphism generated by the expression templates is being transformed into runtime polymorphism: Rule objects of the same type must contain some function pointer, or virtual function, or something that allows one of them to parse groups and another to parse factors. Of course, making that choice of behavior at runtime comes with an attendant loss of efficiency. In a simple grammar like this one, the cost of dynamically dispatching for every rule invocation can really add up.

Joel de Guzman, the primary author of Spirit, has written [Guz03]:

... the virtual function is a totally opaque wall that blocks all meta type information from passing through. We can never get any of the type information of the RHS or a rule, to, say, re-factor the expression template tree to something else (e.g., do automatic left factoring, static node-type-traversal, static first-follow analysis, etc.).

Those operations are all specialized issues related to parsers, but the point is still universal: Type erasure is a kind of "lossy compression," and valuable information may disappear forever.

The Spirit designers could tell us to simply write out the full type of each rule's right-hand-side, but that idea is basically a DSEL-killer. As you can imagine, writing down a complicated type for each rule, and rewriting those types each time the rules change, would quickly become unmanageable. More importantly, we'd have to fill our grammars with information about rule types, which from a DSEL perspective is just noise: It has nothing to do with the underlying domain abstraction.

It's worth noting that even the auto language extension described earlier in this chapter wouldn't completely solve this problem for Spirit, since the grammar rules all reference one another, so the types on the right-hand-side of the first auto initialization can never be known to the compiler.

Spirit resolves this tension between efficiency and expressivity in a familiar way: by putting off work until the last possible moment. Just as Blitz++ achieves efficiency by delaying matrix arithmetic until the entire expression is available, Spirit uses subrules to delay the erasure of static type information until the entire grammar is available. The following rewrite of calculator's definition uses subrules to achieve near-optimal parsing efficiency:



    template <class Tokenizer>


    struct definition


    {


        subrule<0, vars::context_t> expression;


        subrule<1, vars::context_t> group;


        subrule<2, vars::context_t> factor;


        subrule<3, vars::context_t> term;


        subrule<4, vars::context_t> integer;





        rule<Tokenizer> top;





        definition(calculator const& self)


        {


            top = (


                expression =


                    term[expression.value = arg1]


                    >> *(  ('+' >> term[expression.value += arg1])


                          |('-' >> term[expression.value -= arg1]) )





              , group =


                    '(' >> expression[group.value = arg1] >> ')'





              , factor =


                    integer[factor.value = arg1]


                  | group  [factor.value = arg1]





              , term =


                    factor[term.value = arg1]


                    >> *(  ('*' >> factor[term.value *= arg1])


                          |('/' >> factor[term.value /= arg1]) )





              , integer =


                    int_p[integer.value = arg1]





            )[ self.value = arg1 ];


        }





        // tell Spirit to start parsing with "top"


        rule<Tokenizer> const& start() const { return top; }


    };

Two things are particularly worth noticing in this example. First, to achieve this delay without forcing users to write out messy types, the definition of all the subrules has to be done in a single expression. Type erasure doesn't occur until the assignment to top, the only full rule, occurs. At that point, a type even messier than that of any of the right-hand-sides, and containing the definition of all the subrules, is captured in a single virtual function. Once that single dynamic dispatch occurs, the parsing of an expression involves only normal static function calls, many of which can be inlined. The second item of note is that the transformation from dynamically dispatched rules to statically dispatched subrules hardly changed the grammar's representation at all. It is a particularly beautiful feature of Spirit that it offers us the ability to tune our position in the compile-time/runtime continuum so easily, and while staying so close to the fundamental EBNF domain language.