Imperfect C++ Practical Solutions for Real-Life Programming By Matthew Wilson
	Table of Contents

	Chapter 15. Values

15.4. Literals

We saw in the previous example how we can get into trouble by relying on variables having the precise value of literals. This section looks at the situation from the other way round. In other words, it deals with the nature of literals themselves and various inconsistencies and cautions.

15.4.1 Integers

"The type of an integer literal depends on its form, value and suffix" (C++-98: 2.13.1.2). That seems pretty plain. This section of the standard goes on to explain how the suffix affects the type of an integer literal. Basically, if it has no suffix then it will be int or long int if its value would not fit into an int. If it has the suffix l or L, it will be long. If it has a suffix of u or U, then it will be unsigned int, or unsigned long int if unsigned int is not large enough. If it is any combination of U (or u) and L (or l), then its type is unsigned long int.

As we mentioned in the item on NULL, interpretation of which overloaded function call is selected by the compiler can be quite a muddied concept. Consider now that you may have written a set of overloaded functions/methods and client code such as the following:



void f(short i);


void f(long i);





int main()


{


  f(65536);





  . . .

If you'd written this code on a compiler where int was a 16-bit quantity, then this would compile fine, because 65536 exceeds the limits of the 16-bit signed (or unsigned, for that matter) integer capacity and so would be interpreted as a long. The second overload takes a long argument so the function call resolves without issue. However, if you now attempt to compile this on a modern 32-bit compiler, where the size of an int is 32 bits, you'll find that the compiler interprets 65536 as being of type int and therefore cannot select between the ambiguous conversions required to use either short or long overloads.

Imperfection: Specification of the size of integer types as implementation dependent in C/C++ reduces the portability of integer literals.

To be frank, I'm not exactly sure that this warrants being an imperfection. I understand, and largely agree with, the need for the sizes of at least some of the fundamental types to be implementation dependent. Being general-purpose languages C and C++ need to run them on all sorts of architectures, and the only realistic way to have the constant space bound is to shoot for the lowest common denominator if we want compile-time checking. Given that, one cannot fail but to understand the cause for this issue. Nonetheless, it does present a problem, and we must be aware of it. There are two possible solutions.

The first solution is to eschew the use of literal integers in your code in favor of using constants. Since C++ constants should (and these days must) have an explicit type, then the problem pretty much disappears, at least as far as the constant definitions, where it is much more visible and likely to receive more "informed" maintenance attention than when lurking in bits of code. If you don't want to pollute the local namespace—a good instinct to have—and you are certain that the literal is of meaning to the current context only, then you can use a function-local constant.



void f(short i);


void f(long i);





const long THE_NUMBER = 65536; // Namespace-local constant, or





int main()


{


  const long THE_NUMBER = 65536; // ... function-local constant, or


  f(THE_NUMBER);


  if(. . .)


  {


    const long THE_NUMBER = 65536; // block-local constant


    . . .

In those rare circumstances where you feel you must have literals in your code, the second solution is to explicitly type-qualify your literals in situ. This may be by any of the following forms



f(long(65536));                // "Constructs" a long – looks elegant


f((long)65536);                // C-style cast – nasty!


f(static_cast<long>(65536));   // static-cast – not pretty


f(literal_cast<long>(65536));  // Funky custom cast – good for code searches

For fundamental types, the first form, known as function-style cast [Stro1997], is identical to the second C-style cast.^[7] C-style casts are rightly deprecated in most circumstances (see Chapter 19, and [Meye1998, Meye1996, Sutt2000, Stro1994, Stro1997]; also see section 19.3 for the very limited cases where C-style casts are preferred). Even in this case, where their use (in either guise) is generally harmless, the possibility for error still exists in casting to a type that is too small for the given literal. However, the exact same problem exists when using static_cast. In either case, the solution is to raise your compiler's warning levels and to use more than one compiler. The static_cast form is preferred, because it is easier to search for, and because it is ugly [Stro1994], which reminds you that you probably shouldn't be using literal integers in your code in the first place.

^[7] This is a bit of an imperfection in itself, since when used in templates it will do strong checking, à la static_cast<> for most type, but C-style cast for fundamental types, with the consequent problems of potential conversion losses and inefficiencies.

If you want to make your code more amenable to automated code quantification, or you just like being flash, or both, you can take the ugliness to a higher level and implement a literal_cast operator (we learn about how to do such things in Chapter 19), but you may be pushing the bounds of reason (and modesty) if you do.

Whichever approach you choose, it is important to be mindful of the problem and as much of the minutiae of the rules that cause it as you can bear. This awareness will aid you no end when you come across code that has not taken care with its literals.

15.4.2 Suffixes

We've just seen how the type of integer literals is assessed. What I didn't mention was the ominous sentence that constitutes the next clause: "A program is ill-formed if one of the translation units contains an integer literal that cannot be represented by any of the allowed types" (C++-98: 2.13.1.3). In other words, literals for types larger than long are not a defined part of the language.

It's not surprising to learn, therefore, that 32-bit compilers equivocate on the manner in which they will understand the representation of 64-bit integer literals. Some need LL/ULL, others L/UL; still others accept both. Needless to say, this is quite a hassle when trying to write portable code.

Imperfection: C++ compilers do not agree on the syntax for suffixes of integer literals larger than those that will fit into (unsigned) long.

The solution to this is as mundane and unattractive as the problem: macros. The Synesis Software numeric limits header contains the following ugly macros:



#define __SYNSOFT_GEN_S8BIT_SUFFIX(i)   (i)


. . .


#define __SYNSOFT_GEN_U32BIT_SUFFIX(i)  (i ## UL)


#if ( __SYNSOFT_DVS_COMPILER == __SYNSOFT_VAL_COMPILER_DMC || \


      __SYNSOFT_DVS_COMPILER == __SYNSOFT_VAL_COMPILER_DECC || \


      . . .


      __SYNSOFT_DVS_COMPILER == __SYNSOFT_VAL_COMPILER_XLC)


# define __SYNSOFT_GEN_S64BIT_SUFFIX(i)   (i ## LL)


# define __SYNSOFT_GEN_U64BIT_SUFFIX(i)   (i ## ULL)


#else


# define __SYNSOFT_GEN_S64BIT_SUFFIX(i)   (i ## L)


# define __SYNSOFT_GEN_U64BIT_SUFFIX(i)   (i ## UL)


#endif /* compiler */

and the following symbol definitions:



/* 8-bit. */


#define __SYNSOFT_VAL_S8BIT_MAX   \


    (+ __SYNSOFT_GEN_S8BIT_SUFFIX(127))


. . .


#define __SYNSOFT_VAL_U32BIT_MAX  \


    (  __SYNSOFT_GEN_U32BIT_SUFFIX(0xffffffff))


#define __SYNSOFT_VAL_U32BIT_MIN  \


    (  __SYNSOFT_GEN_U32BIT_SUFFIX(0x00000000))


/* 64-bit. */


#define __SYNSOFT_VAL_S64BIT_MAX  \


    (+ __SYNSOFT_GEN_S64BIT_SUFFIX(9223372036854775807))


#define __SYNSOFT_VAL_S64BIT_MIN  \


    (- __SYNSOFT_GEN_S64BIT_SUFFIX(9223372036854775807) - 1)


#define __SYNSOFT_VAL_U64BIT_MAX  \


    (  __SYNSOFT_GEN_U64BIT_SUFFIX(0xffffffffffffffff))


#define __SYNSOFT_VAL_U64BIT_MIN  \


    (  __SYNSOFT_GEN_U64BIT_SUFFIX(0x0000000000000000))

Yuck! If you are confident that you can avoid name clashes with fewer characters in your macros (or you're more reckless than me), then feel free to define something like S64Literal() and U64Literal(), which would make it a lot more manageable. Whatever macro you use, you can achieve portability (though not beauty) in the following way:



int64_t   i = __SYNSOFT_GEN_S64BIT_SUFFIX(1234567891234567891);


uint64_t  i = U64Literal(0xDeadBeefDeadBeef);

15.4.3 Strings

Literal strings are expressed in C and C++ within the two double-quotes, as in "this is a literal" and L"so is this". A literal is simply a null-terminated contiguous number of char or wchar_t (when prefixed with L) items. Hence, the literal L"string" is an array of the seven characters (of type wchar_t) 's', 't', 'r', 'i', 'n', 'g' and 0. The null terminator enables the literals to be interpreted as C-style strings, and passed around by a pointer. (Without null termination a length would also have to be included.)

Some languages ensure that all equivalent literal strings are "folded" into the same storage, which means that meaningful comparisons can be made between the pointers (or the language equivalent to a pointer) to the strings, rather than having to compare the string contents. For eminently sensible practical reasons (see section 9.2), C++ does not do this, and so code such as that shown in Listing 15.4 is neither correct C++ syntax, nor semantically valid.

Listing 15.4.



enum Type


{


    abc


  , def


  , unknown


};





void interpret(char const *s)


{


  switch(s)


  {


    case "abc":


    case "ABC":


      return abc;


    case "def":


    case "DEF":


      return def;


    default:


      return unknown;


   }


}

However, since some compilers can, and do, ensure that all identical strings within a link unit are "folded" into one instance, it is possible to do something such as shown in Listing 15.5.

Listing 15.5.



Type interpret(char const *s)


{


  if(s == "abc")


  {


    return abc;


  }


  else if(s == "def")


  {


    return def;


  }


  else


  {


    return unknown;


  }


}

However, this is still not a sensible thing to do, for two reasons. First, if the executing process consists of more than one link unit (see section 9.2), which is very common, then it is possible for s to be a pointer to "abc" but fail to match the "abc" in the first conditional subexpression. Second, programs often deal with character strings that are generated, copied, and built up from pieces of others. It is likely in many scenarios that s will point to a character string that is not a literal created by the compiler, but rather one generated within program execution. In either case, the plain pointer comparison fails to correctly identify logical equivalence of the strings to which they point.

Constraint: C and C++ do not guarantee to ensure that identical string literals will be folded within a single link unit, and cannot do so across separate link units.

The advice is, of course, to effect string comparison by value, using strcmp() or similar functions, or to use string objects and use their operator ==() overloads for char/wchar_t const*. However, you should not completely discount the pointer testing. Indeed, there are circumstances wherein a substantial number of the string pointers passed to interpret() would be literals (perhaps because they were handed out from a translation table). In such cases the following code may be appropriate.



enum Type


{


    abc


  , def


  , unknown


};





Type g(char const *s)


{


  if(s == "abc")


  {


    return abc;


  }


  else if(s == "def")


  {


    return def;


  }


  else


  {


    if(0 == strcmp(s, "abc"))


    {


      return abc;


    }


    else if(0 == strcmp(s, "def"))


    {


      return def;


    }


  }


  return unknown;


}

Please bear in mind that the circumstances in which this technique is appropriate are few and far between, and you should only evaluate whether this "enhancement" is beneficial when you've determined through quantitative performance analysis that the interpret() function is a source of significant latency. In other words, listen to the legions of esteemed engineers [Kern1999, Sutt2000, Stro1997, Meye1996, Dewh2003, Broo1995] who warn of premature optimization!

We know that equivalent non-empty literal strings are not guaranteed not to be different, but I wondered whether the standard dictated that the empty string — "" — would be a special case. I could find nothing^[8] in the standard that said that they are treated as one, so I conducted a simple test on a selection of compilers, which amounted to the following code:

^[8] Of course, absence of evidence is not evidence of absence.



int main()


{


  char const  *p1 = "";


  char const  *p2 = "";


  printf("%sequal\n", (p1 == p2) ? "" : "not-");


  return 0;


}

It seems that with Borland, Intel (in debug mode), and Watcom p1 and p2 are different. With CodeWarrior, Digital Mars, GCC, Intel (in release mode), and Visual C++ they are the same. I expect all of these compilers have flags to enforce folding (or "duplicate string merging" as it's also called) of literals but despite that, there's enough variation to make us be very wary of making assumptions.

You're probably wondering why I'm giving this issue such attention. Well, it could be convenient to use the empty string — "" — as a sentinel value. For example, you may implement a string class String and use the empty string for empty (e.g., default constructed) instances rather than allocate an array of one character (with the value '\0'). The String destructor would then compare with the empty string literal and skip deallocation when it was "containing" the empty string.

With such a technique instances would always be able to render a non-null pointer to a string—as decreed by the String model (C++-98: 21.3)—while eliminating the costs of allocating those single character arrays for each empty string class instance. There's an example of this in the string class discussed in section 2.3.1.

However, we've learned that this is not prescribed by the standard, is only partially provided by the compilers, and is practically impossible when dealing with dynamic libraries. So the simple advice is, don't ever rely on equivalent literal strings always having the same address, but you can optimize for it happening some of the time.