Section 10.2. ... Goes a Long Way

10.2. ... Goes a Long Way

Jon Bentley, in his excellent article on DSLs, wrote that "programmers deal with microscopic languages every day" [Bent86]. Now that you are aware of their fundamental properties, it's easy to see that little languages are all around us.

In fact, the examples are so numerous that this book can't possibly discuss all of themwe estimate that thousands of DSLs are in common use todaybut we can survey a few to present you with some more perspective.

10.2.1. The Make Utility Language

Building software rapidly, reliably, and repeatably is crucial to the daily practice of software development. It also happens to be important to the deployment of reusable software andincreasingly in the age of open-source softwareend-user installation. A great many tools have cropped up over the years to address this problem, but they are nearly all variations of a single, powerful, build-description language: Make. As a C++ programmer, you're probably already at least a little familiar with Make, but we're going to go through a mini-review here with a focus on its "DSL-ness" and with an eye toward the design of your own domain-specific languages.

The principal domain abstraction of Make is built around three concepts.

Targets

Usually files that need to be built or sources that are read as inputs to parts of the build process, but also "fake" targets naming states of the build process that might not be associated with a single file.

Dependencies

Relationships between targets that allow Make to determine when a target is not up-to-date and therefore needs to be rebuilt.

Commands

The actions taken in order to build or update a target, typically commands in the native system's shell language.

The central Make language construct is called a rule, and is described with the following syntax in a "Makefile":



    dependent-target : source-targets


            commands

So, for example, a Makefile to build a program from C++ sources might look like this:



    my_program: a.cpp b.cpp c.cpp d.cpp


            c++ -o my_program a.cpp b.cpp c.cpp d.cpp

where c++ is the command that invokes the C++ compiler. These two lines demonstrate that Make allows a concise representation of its domain abstractions: targets (my_program and the .cpp files), their dependency relationships, and the command used to create dependent targets from their dependencies.

The designers of Make recognized that such rules include some boilerplate repetition of filenames, so they included support for variables as a secondary capability. Using a variable, the above "program" might become:



    SOURCES = a.cpp b.cpp c.cpp d.cpp


    my_program: $(SOURCES)


            c++ -o my_program $(SOURCES)

Unfortunately, this is not a very realistic example for most C/C++ programs, which contain dependencies on header files. To ensure minimal and rapid rebuilds once headers enter the picture, it becomes important to build separate object files and represent their individual dependencies on headers. Here's an example based on one from the GNU Make manual:



    OBJECTS = main.o kbd.o command.o display.o \


              insert.o search.o files.o utils.o





    edit : $(OBJECTS)


            c++ -o edit $(OBJECTS)


    main.o : main.cpp defs.h


            c++ -c main.cpp


    kbd.o : kbd.cpp defs.h command.h


            c++ -c kbd.cpp


    command.o : command.cpp defs.h command.h


            c++ -c command.cpp


    display.o : display.cpp defs.h buffer.h


            c++ -c display.cpp


    insert.o : insert.cpp defs.h buffer.h


            c++ -c insert.cpp


    search.o : search.cpp defs.h buffer.h


            c++ -c search.cpp


    files.o : files.cpp defs.h buffer.h command.h


            c++ -c files.cpp


    utils.o : utils.cpp defs.h


            c++ -c utils.cpp

Once again you can see some repeated boilerplate in the commands used to build each object file. That can be addressed with "implicit pattern rules," which describe how to build one kind of target from another:



    %.o: %.cpp


            c++ -c $(CFLAGS) $< -o $@

This rule uses pattern-matching to describe how to construct a .o file from a .cpp file on which it depends, and the funny symbols $< and $@ represent the results of those matches. In fact, this particular rule is so commonly needed that it's probably built into your Make system, so the Makefile becomes:



    OBJECTS = main.o kbd.o command.o display.o \


              insert.o search.o files.o utils.o


    edit : $(OBJECTS)


            c++ -o edit $(OBJECTS)





    main.o : main.cpp defs.h


    kbd.o : kbd.cpp defs.h command.h


    command.o : command.cpp defs.h command.h


    display.o : display.cpp defs.h buffer.h


    insert.o : insert.cpp defs.h buffer.h


    search.o : search.cpp defs.h buffer.h


    files.o : files.cpp defs.h buffer.h command.h


    utils.o : utils.cpp defs.h

Enough review! Exploring all the features of Make could easily fill an entire book. The purpose of this exercise is to show that Make begins to approach the domain-specific language ideal of allowing a problem to be solved merely by describing itin this case, by writing down the names of files and their relationships.

In fact, most of the other features of various Make variants are aimed at getting still closer to the ideal. GNU Make, for example, can automatically discover eligible source files in the working directory, explore their header dependencies, and synthesize the rules to build intermediate targets and the final executable. In a classic example of creolization [Veld04], GNU Make has sprouted so many features that it approaches the power of a general-purpose languagebut such a clumsy one that for all practical purposes it is still domain-specific.

10.2.2. Backus Naur Form

After all this discussion of metaprogramming, we're going to introduce the idea of a metasyntax. That's exactly what Backus Naur Form (BNF) is: a little language for defining the syntax of formal languages.^[2] The principal domain abstraction of BNF is called a "context-free grammar," and it is built around two concepts.

^[2] BNF was actually first developed to specify the syntax of the programming language Algol-60.

Symbols

Abstract elements of the syntax. Symbols in the grammar for C++ include identifier, unary-operator, string-literal, new-expression, statement, and declaration. The first three are never composed of other symbols in the grammar and are called terminal symbols or tokens. The rest can be built from zero or more symbols and are called nonterminals

Productions (or "rules")

The legal patterns for combining consecutive symbols to form nonterminal symbols. For example, in C++ a new-expression can be formed by combining the new keyword (a token) with a new-type-id (a nonterminal).

Productions are normally written according to the syntax:



    nonterminal -> symbols...

where the nonterminal symbol to the left of the arrow can be matched by any input sequence matching the sequence of symbols on the right.

Here is a grammar for simple arithmetic expressions, written in BNF, with terminals shown in bold and nonterminals shown in italics:



    expression -> term


    expression -> expression + term


    expression -> expression - term





    term -> factor


    term -> term * factor


    term -> term / factor





    factor -> integer


    factor -> group





    group -> ( expression )

That is, an expression is matched by a term, or by an expression followed by the + token and a term, or by an expression followed by the - token and a term. Similarly, a term is matched by a factor, or by a term followed by the * token and a factor, or by a term followed by the / token and a factor ... and so on.

This grammar not only encodes the allowed syntax of an expression (ultimately just one or more integers separated by +, -, *, or /), but, by grouping syntactic elements according to the operators' usual associativity and precedence rules, it also represents some important semantic information. For example, the structure of



    1 + 2 * 3 + 4

when parsed according to the above grammar, can be represented as:



    [1 + [2 * 3]] + 4

In other words, the subexpression 2 * 3 will be grouped into a single term and then combined with 1 to form a new (sub-) expression. There is no way to parse the expression so as to generate an incorrect grouping such as



    [[1 + 2] * 3] + 4

Try it yourself; the grammar simply doesn't allow the expression 1 + 2 to be followed by *. BNF is very efficient for encoding both the syntax and the structure of formal languages.

A few linguistic refinements are possible: For example, it's customary to group all productions that yield a given nonterminal, so the | symbol is sometimes used to separate the different right-hand-side alternatives without repeating the "nonterminal ->" boilerplate:



    expression -> term


                 | term + expression


                 | term - expression

Extended BNF (EBNF), another variant, adds the use of parentheses for grouping, and the Kleene star ("zero-or-more") and positive closure ("one-or-more") operators that you may recognize from regular expressions for repetition. For example, all the rules for expression can be combined into the following EBNF:



    expression -> ( term + | term - )* term

That is, "an expression is matched by a sequence of zero or more repetitions of [a term and a + token or a term and a - token], followed by a term."

All grammars written in EBNF can be transformed into standard BNF with a few simple steps, so the fundamental expressive power is the same no matter which notation is used. It's really a question of emphasis: EBNF tends to clarify the allowable inputs at the cost of making the parse structure somewhat less apparent.

10.2.3. YACC

As we mentioned in Chapter 1, YACC (Yet Another Compiler Compiler) is a tool for building parsers, interpreters, and compilers. YACC is a translator whose input language is a form of augmented BNF, and whose output is a C/C++ program that does the specified parsing and interpreting. Among computer language jocks, the process of interpreting some parsed input is known as semantic evaluation. YACC supports semantic evaluation by allowing us to associate some data (a semantic value) with each symbol and some C/C++ code (a semantic action) with the rule. The semantic action, enclosed in braces, computes the semantic value of the rule's left-hand-side nonterminal from those of its constituent symbols. A complete YACC program for parsing and evaluating arithmetic expressions follows:



    %{ // C++ code to be inserted in the generated source file


      #include <cstdio>


      typedef int YYSTYPE; // the type of all semantic values





      int yylex();                       // forward


      void yyerror(char const* msg);     // forward


    %}


    %token INTEGER     /* declare a symbolic multi-character token */


    %start lines       /* lines is the start symbol */





    %% /* grammar rules and actions */


    expression : term


               | expression '+' term { $$ = $1 + $3; }


               | expression '-' term { $$ = $1 - $3; }


               ;


    term : factor


         | term '*' factor { $$ = $1 * $3; }


         | term '/' factor { $$ = $1 / $3; }


         ;





   factor : INTEGER


          | group


          ;





   group : '(' expression ')'   { $$ = $2; }


         ;


   


   lines : lines expression


           {


             std::printf("= %d\n", $2);  // after every expression


             std::fflush(stdout);        // print its value


           }


           '\n'


         | /* empty */


         ;





   %% /* C++ code to be inserted in the generated source file */


   #include <cctype>





   int yylex() // tokenizer function


   {


     int c;





     // skip whitespace


     do { c = std::getchar(); }


     while (c == ' ' || c == '\t' || c == '\r');





     if (c == EOF)


       return 0;


     if (std::isdigit (c))


     {





         std::ungetc(c, stdin);


         std::scanf("%d", &yylval); // store semantic value


         return INTEGER;


     }


     return c;


   }


   // standard error handler


   void yyerror(char const* msg) { std::fprintf(stderr,msg); }





   int main() { int yyparse(); return yyparse(); }

As you can see, some of the C++ program fragments in curly braces are not quite C++: they contain these funny $$ and $n symbols (where n is an integer). When YACC translates these program fragments to C++, it replaces $$ with a reference to the semantic value for the rule's left-hand-side nonterminal, and $n with the semantic value for the nth right-hand-side symbol. The semantic actions above come out looking like this in the generated C++:



    yym = yylen[yyn];


    yyval = yyvsp[1-yym];


    switch (yyn)


    {


    case 1:


    { std::printf("= %d \n", yyvsp[0]); std::fflush(stdout); }


    break;


    case 8:


    { yyval = yyvsp[-2] * yyvsp[0]; }


    break;


    case 9:


    { yyval = yyvsp[-2] / yyvsp[0]; }


    break;


    case 11:


    { yyval = yyvsp[-2] + yyvsp[0]; }


    break;


    case 12:


    { yyval = yyvsp[-2] - yyvsp[0]; }


    break;


    }


    yyssp -= yym;


    ...

This code is just a fragment of a source file full of similar unreadable ugliness; in fact, the BNF part of the grammar is expressed in terms of large arrays of integers known as parse tables:



    const short yylhs[] = {


       -1,


        2,    0,    0,    3,    3,    4,    5,    5,    5,    1,


        1,    1,


    };


    const short yylen[] = {


        2,


        0,    4,    0,    1,    1,    3,    1,    3,    3,    1,


        3,    3,


    };


    const short yydefred[] = { ... };


    const short yydgoto[] = { ... };


    const short yysindex[] = { ... };


    const short yyrindex[] = { ... };


    const short yygindex[] = { ... };

You don't need to understand how to generated code works: It's the job of the DSL to protect us from all of those ugly details, allowing us to express the grammar in high-level terms.

10.2.4. DSL Summary

It should be clear at this point that DSLs can make code more concise and easy-to-write. The benefits of using little languages go well beyond rapid coding, though. Whereas expedient programming shortcuts can often make code harder to understand and maintain, a domain-specific language usually has the opposite effect due to its high-level abstractions. Just imagine trying to maintain the low-level parser program generated by YACC for our little expression parser: Unless we had the foresight to maintain a comment containing something very close to the YACC program itself, we'd have to reverse engineer the BNF from the parse tables and match it up to the semantic actions. The maintainability effect becomes more extreme the closer the language gets to the domain abstraction. As we approach the ideal language, it's often possible to tell at a glance whether a program solves the problem it was designed for.

Imagine, for a moment, that you're writing control software for the Acme Clean-Burning Nuclear Fusion Reactor. The following formula from a scientific paper describes how to combine voltage levels from three sensors into a temperature reading:



    T = ( a+3.1 )( b+4.63 )( c+2x10⁸ )

You need to implement the computation as part of the reactor's failsafe mechanism. Naturally, using operator notation (C++'s domain-specific sublanguage for arithmetic) you'd write:



    T = ( a + 3.1 ) * ( b + 4.63 ) * ( c + 2E8 );

Now compare that to the code you'd have to write if C++ didn't include support for operators:



    T = mul(mul(add(a, 3.1), add(b, 4.63)), add(c, 2E8));

Which notation do you trust more to help prevent a meltdown? Which one is easier to match up with the formula from the paper? We think the answer is obvious. A quick glance at the code using operator notation shows that it implements the formula correctly. What we have here is a true example of something many claim to have seen, or even to have produced themselves, but that in reality is seldom encountered in the wild: self-documenting code.

Arithmetic notation evolved into the standard we use today because it clearly expresses both the intent and the structure of calculations with a minimum of extra syntax. Because mathematics is so important to the foundation of programming, most computer languages have built-in support for standard mathematical notation for operations on their primitive types. Many have sprouted support for operator overloading, allowing users to express calculations on user-defined types like vectors and matrices in a language that is similarly close to the native domain abstraction.

Because the system knows the problem domain, it can generate error reports at the same conceptual level the programmer uses. For example, YACC detects and reports on grammatical ambiguities, describing them in terms of grammar productions rather than dumping the details of its parse tables. Having domain knowledge can even enable some pretty impressive optimizations, as you'll see when we discuss the Blitz++ library later in this chapter.

Before moving on, we'd like to make a last observation about DSLs: It's probably no coincidence that both Make and BNF have a "rule" concept. That's because DSLs tend to be declarative rather than imperative languages. Informally, declarative languages describe rather than prescribe. A purely declarative program mentions only entities (e.g., symbols, targets) and their relationships (e.g., parse rules, dependencies); the processing or algorithmic part of the program is entirely encoded in the program that interprets the language. One way to think of a declarative program is as an immutable data structure, to be used by the language's conceptual execution engine.