Imperfect C++ Practical Solutions for Real-Life Programming By Matthew Wilson
	Table of Contents

	Chapter 13. Fundamental Types

13.1. May I Have a byte?

Computers use the byte as their fundamental unit of memory storage,^[1] so it would seem natural to have a type with which to access and manipulate bytes. Most of the time in C++, and in C, we are interested in dealing with specific types, for example, char const*, Person&, and so on. However, there are times when we need to manipulate opaque blocks of memory, such as in data compression. In such cases, we are dealing with chunks of bytes, where the content of individual bytes has no meaning.

^[1] Modern architectures allow individual bytes to be addressed (though there are caveats; see section 10.1), but there have been architectures where pointers of different sizes were used to address different types (e.g., 16-bit pointers for 16-bit words and 24-bit pointers for bytes).

Imperfection: C and C++ are missing a byte type.

Unfortunately, there is no byte type in C/C++, and so the common practice in such cases is to use char. This makes sense, since the size of char is always one byte. The (C++-98) standard does not say this directly, but it does say (C++-98: 5.3.3) that "the sizeof operator yields the number of bytes in the object representation of its operand" and "sizeof(char), sizeof(signed char) and sizeof(unsigned char) are 1." Clearly, then, sizeof(char) == sizeof(byte) must always be true. (Note that a byte is not necessarily 8 bits, just that it is "large enough to fit the basic character set" [C++-98: 3.9.1.1]. Irrespective of the bit size, however, sizeof(byte) == 1 is an axiom.)

However, there are problems with this approach.

13.1.1 Look for a Sign

The first problem is that the "signedness" of the char type, when not qualified with signed or unsigned, is implementation dependent. This leads to problems when the char-as-byte variable is used in arithmetic conversions where we need to interpret its meaning as a number. If char is signed, then assigning it to a larger type will sign-extend the value [Stro1997]. If an 8-bit char's value were 0xff, then when it is involved in an integer promotion, say to a 32-bit integer (whether signed or unsigned), the result would be 0xffffffff, which is –1 for signed or 4294967295 for unsigned. If char is unsigned, the same assignment will result in a value of 0x000000ff, which is 255 irrespective of integer sign.

The answer to this issue is very simple: always specify the sign. In my opinion the better option is to use unsigned to avoid the sign extension, since this fits my notion of a byte as a collection of bits, though I concede that this is not unequivocally superior to choosing signed; some argue that signed is preferable for finding underrun bugs. The important thing is to make an explicit choice and remove the ambiguity.

13.1.2 It's All in the Name

The second problem is that the names of types carry semantic significance: they suggest their intended use. Of course, the names of types don't matter in the least to a compiler, but they are extremely important to human writers and readers of code. Using (un)signed char for a byte is misleading.

Using a type char indicates to the reader that the variable represents a character; if it's a byte, it should be called byte, or byte_t or something similarly obvious. Given Robert L. Glass's [Glas2003] assertion that 60 percent of software engineering is maintenance, it seems prudent to keep your code's maintainer(s)—which may be you in 18 months' time—as informed as possible.

The answer here is to provide a suitable typedef for a byte type, as in:



typedef unsigned char byte_t;

Consider the following two variable declarations:



unsigned char   v1;


byte_t          v2;

v2 is unambiguously a byte, with all the semantic significance that conveys. v1 may or may not connote "byte" as opposed to character, depending on the experience and instincts of the developer reading this code. Unfortunately, the APIs for multibyte character set (MBCS) encoding schemes in the very widely used Visual C++ libraries use unsigned char (const) * for MBCS character strings, which pretty much nullifies these instincts for developers who use these libraries. Even when unsigned char always says "byte" to all the readers of such code, it is all too easy to leave off the sign qualifier, either in your code or in your mind.

Another benefit of the explicit stipulation of sign is that compilers will reject statements like the following:



signed byte_t   x;


unsigned byte_t y;

This further emphasizes the (logical) independence of bytes from sign.

13.1.3 Peering into the Void

An important use for a byte type is to be able to represent pointers to bytes. A common practice is to use void*, which certainly primes the reader to think in terms of "pure memory." Using void* has the advantage that the compiler will pick up on any attempts to use pointer arithmetic, which it otherwise performs automatically for us with other pointer types. However, it presents its own difficulties in that pointer arithmetic involves casts to and from byte-sized type pointers, with the consequent potential for mistakes.

Some APIs express pointers to bytes as unsigned char (const)*. The problem from a human point of view is that seeing a parameter of type char* inclines the reader to think of the recipient of a writable string buffer or of a single character, and a parameter of type char const* to think of a null-terminated string. Explicitly qualifying with (un)signed helps in this regard, of course, but it only takes a couple of slips of the sign qualifier to propagate throughout the code base and you're at 300kph on the autobahn to unmaintainability. In any case, we've seen that some libraries use unsigned char to represent a character.

A slightly better solution might be to use the C99 type uint8_t (const)* (see next item), but this is not (yet) part of standard C++, and it still says "integer" (number) rather than "byte" (opaque value).^[2]

^[2] And on those very rare architectures where a byte is not 8 bits, it would be very misleading!

Using byte_t (const)* s means that we can express pointers to opaque bytes in a simple, readable way, and we don't have to resort to any casting with pointer arithmetic, so the code is immune to pointer-type mismatch and offset problems.

13.1.4 Extra Safety

There's one more step that can be employed to eke out a little extra type safety. I've used the approach shown in Listing 13.1 in several libraries.

Listing 13.1.



#if defined(ACMELIB_COMPILER_IS_INTEL) || \


    defined(ACMELIB_COMPILER_IS_MSVC)


 typedef unsigned __int8        byte_t;


...


#else


 typedef unsigned char          byte_t;


#endif /* compiler */

The __int8 type is an 8-bit integer defined by the Intel and Visual C++ (and a few other) compilers. There is a feature (maybe it's a bug?) of the Visual C++ 6.0 compiler, which is emulated in the Intel compiler,^[3] which means that int8 and char are not considered as simple aliases of each other (see section 18.3). Rather they represent distinct types.

^[3] When used in Visual C++ 6.0 compatibility mode.

We can turn this to our own advantage by defining byte_t to be unsigned __int8 when compiling with these compilers, which allows us to write functions with greater type safety. (Naturally, the fact that we're defining byte in terms of a specifically 8-bit integer would run aground on a platform where a byte might contain 9 bits, but on such a platform we'd adjust the definition of byte_t accordingly.)

Consider the following class, which is written to work with memory but not characters.



class NoCharsPlease


{


public:


  NoCharsPlease(byte_t *);


// Not to be implemented


private:


#ifdef ACMELIB_CF_DISTINCT_BYTE_SUPPORT


  NoCharsPlease(char *);


  NoCharsPlease(signed char *);


  NoCharsPlease(unsigned char *);


#endif /* ACMELIB_CF_DISTINCT_BYTE_SUPPORT */


};

Now if you try to pass a pointer to a char-based block of memory, the compiler will reject it.



byte_t        *pc   = new byte_t[10];


unsigned char *puc  = new char[10];


NoCharsPlease  ncp1(pc);  // Ok


NoCharsPlease  ncp2(puc); // Compile error

Since this only works on a small (and aging) subset of compilers, and relies on nonstandard compiler features at that, it is arguable as to whether you would want to use this technique. It can be helpful when you are using several compilers to maximize your information, however (see Appendix C). Since I tend to do that, I use this technique.

A portable and future-proof alternative, for C++ compilation only, is to use the True Typedef technique (see chapter 18), thereby making conversions between the byte type and any other integral type invalid.