CHAPTER 3: ARRAYS

An array is an aggregate object, which is composed of other objects, its elements. The dimension of an array is the number of objects that it contains. Its size is given by multiplying the dimension by the size of each element.

In C, arrays are used for storing character strings as well as for their more universal uses. As we shall see, this makes a reliable treatment of "properties of an array" somewhat more complex than in languages in which strings are objects in their own right. As always, our concern is for rules of programming which will ensure the reliable use of arrays.

3.1 Array Data

Topics assumed from previous book: arrays and subscripting [3.14], array arguments [5.4].

Just as we did for scalars, we will consider three aspects of array data: type, properties, and allowable operations.

Type

Properties

Allowable operations

Type

In C, the declaration of one variable, such as line in

static char line[10] = "msg";

consists of five components:

static    storage class (optional)

char      type specifier

line[10]  declarator

= "msg"   initialization (optional)

;         semicolon (of course)

Strictly speaking, the storage class and the type specifier can appear in any order, but declarations are clearer in a conventional order:

Rule 3-1: Storage class (if any) should precede the type specifier.

The examples will henceforth assume this rule.

For another variation, several variables can be declared in one declaration:

static char line_a[10], line_b[10];

declares two variables, line_a and line_b. This is in every way equivalent to two separate declarations:

static char line_a[10];

static char line_b[10];

It is a common error to assume that the following initialization applies to both line_a and line_b:

static char line_a[10], line_b[10] = "msg"; misleading

Rule 3-2: If a variable has an initialization, its declaration should have a source line to itself.

Setting aside the above complications, we will consider a declaration to consist of an optional storage class, a type specifier, a declarator, and an optional initialization. Now let us restrict our consideration to the non-optional parts, the type specifier and the declarator. For the rest of this section, when we refer to a "declaration," we will be referring only to the mandatory parts.

So declarations can be sliced into type specifier and declarator. But these syntactic categories are not of much practical importance to programmer; they are merely the syntactic terms. The important way to slice a declaration is into a variable name and a type. In a very simple declaration such as

short i;

the declarator consists only of the variable name (i), and the type specifier consists only of the type (short). But in a more complex declaration such as

short m[2];

the declarator m[2] contains the variable name m plus a part of the type. The type consists of what is left when the variable name is "scratched out" of the declaration, namely short [2]. In this sense, we can say that a declaration is a "sandwich" composed of a variable name and a type. The type is the "bread" wrapped around the variable name "meat" in the middle:

short m[2]  is the declaration

m     is the name

short  [2]  is the type

In C, the representation of an array consists of a sequence of contiguous objects (the elements of the array), running from low addresses to high addresses.

The meaning of an array declaration depends upon context. A parameter (typically declared with no stated dimension) is understood by C to be a pointer rather than an array (see Section 3.4). And a non-defining declaration of an external array, such as

extern short em[];

is an allowable form which declares an array of unknown size. (The size is determined by the defining instance, but a separately compiled function has no built-in way of determining that size.)

For an array whose size is known from the declaration (externally defined, or declared to be auto or static), we can define a useful macro for the dimension:

#define DIM(a) (sizeof(a)/sizeof(a[0]))

In other words, the dimension is the size of the array divided by the size of one element. Since both sizes are known at compile time, DIM(a) is effectively a constant, and can be used as such. One caution: the macro DIM should not be used with a pointer argument. The expression sizeof(p) always produces the size of one pointer, not the size of an array.

Properties

Reliability is often enhanced by maintaining a clear intuitive meaning for each data object. A data object, after all, reflects something about the world outside the program: a person's address, an amount of money, the pressure on an automobile brake pad, the temperature of a blast furnace, etc. Such interpretations are clearer if each object has a defining property which is maintained throughout a computation.

Rule 3-3: Document the defining property of a data object with a comment on its declaration. Ensure that this defining property remains invariant (unchanging) as much as possible throughout the computation, and document any exceptions.

This book's system of properties uses different names for aggregate properties than for scalar properties; this will allow us to say some simpler things about pointers later. An array is either complete (all scalar elements contained within the array are defined) or incomplete (one or more elements are not defined).

An array can become complete either by assignment to all its elements, or by an initializer on its declaration. (Any initializer causes the initialization of the entire array.)

Rule 3-4: A program is easier to write correctly and to understand if all arrays are made complete before the array is used.

Some arrays have other properties that are more important than "complete" or "incomplete." A "string," for example, requires defined characters only up to a nul (i.e., '\0') terminator. The characters after the nul terminator can be total garbage, and the array still has its defining property satisfied, namely being a string.

Rule 3-5: If an array's defining property can be true even if not all elements are defined, indicate the property on the array's declaration. For example,

char s[10];   /* : string */

Lacking such mention, it is assumed that the array must be complete (all elements defined) before its value is used.

For any array, there are boundary values to be considered at the lowest and highest subscripts. These will always be important in determining valid access ("avoiding out-of-bounds reference"). Every array algorithm requires verification of the proper handling of the subscript boundaries.

A second issue concerns the "one-too-far" value required to terminate an ordinary for loop that walks over an array. As discussed in Section 2.7, if subscripts can be unsigned integers, there is always a valid "one-too-far" subscript value at the high end, but none at the low end.

In many cases, the defining property for an array is more restrictive than mere completeness. Let us now consider an array as an object to be sorted. In doing so, we will need a notation for a subarray, i.e. a contiguous sequence of elements of an array. The notation a[low:high] will be used to mean the sequence of elements from a[low] to a[hi]. And the notation a[*] will mean "the entire array a." There are, of course, no such notations in C itself, but the notations will help us to speak more clearly about array algorithms.

To focus for now upon the general problem of sorting the elements of an array, the basic invariant property of such an array is that it contains individual-element values, the set of which remains unchanged from the start to the finish of the sorting process. In other words, the set of values remains invariant, although the positions of those values will be altered as the sort progresses. This is analogous to the sorting of a deck of cards; the values written on the cards do not change, but the ordering does change. To be precise, the array is a "multiset," which is like a set but individual values can occur more than one time. A deck of bridge cards is a proper set; each card appears only once (considering suits as distinct, as of course they are in bridge). A deck of Canasta cards is a multiset; each card appears several times in the deck.

We need to define one more property of an array. An array a is said to be sorted in ascending order if, for each i from 1 to DIM(a)-1, a[i] is greater than (or equal to) each element of a[0:i-1]. (An array containing only one element will be considered to be sorted, trivially.) More generally, a[low:high] is sorted (in ascending order) if, for each i from low+1 to high, a[i] is greater than (or equal to) each element of a[low:i-1].

The property of being sorted is not invariant during the operation of sorting an array. Being sorted is the goal property, not the invariant.

Allowable operations

C provides no built-in operators that act upon all elements of an array, but operations can be created from a succession of operations upon the individual elements. As we examine different data structures in this book, one common theme will be the creation of C functions and macros which perform specific operations upon a data structure. Designing the interface for a function is a subject in itself, and we will discuss the issue throughout the book. For now, our packaging criterion is minimal: we will employ the simplest C function that will encapsulate the data structure and algorithm that are our focus of attention.

The only allowable operations upon multiset arrays are those that perform a re-ordering without altering the actual values in the array. We have seen one such operation, in the form of the SWAP macro. Now we will examine the operation of sorting an array. For simplicity, we will restrict ourselves to an array of short integers. The "sorting" operation will take the form of a function with two arguments: an array parameter a designating the array to be sorted, and an integer n designating the number of elements in the array to be sorted. The function is supposed to arrange the n elements of a such that a[0:n-1] is sorted.

3.2 Sorting an Array

When we design the operations to be performed upon data, there are two more aspects of importance: loop invariants and assertions.

Loop invariants

Assertions and run-time validation

Loop invariants

Assertions and run-time validation

Loop invariants

In designing an algorithm to accomplish an operation such as sorting, the key design principle is to formulate a loop invariant which is true at each step of the computation, and which guarantees that the loop has attained its goal when the loop terminates.

The first sorting algorithm we will consider is called insertion sort. Using i as the loop subscript, the invariant of the insertion sort is "the subarray a[0:i-1] is sorted." The loop terminates when i reaches the "one-too-far" value, n. When it does, the invariant becomes "the subarray a[0:n-1] is sorted," which is the goal of the loop.

We will use the notation "name => property" to mean "name now has this property." Describing the loop invariants in program comments is often very useful documentation; we could, for example, write

for (i = 1; i < n; ++i)

/* a[0:i-1] => sorted */

insert a[i] into its correct place

/* a[0:n-1] => sorted */

To insert a[i] into the correct place, we copy it into a temporary (t), thus freeing up the location a[i]. Then, for each element of a[0:i-1] which is greater than t, we move it one position to the right, then insert t into its ordered position. (This is the insertion sort algorithm from Bentley [1984a].) Implemented as a function, this is what our algorithm looks like:

isort:

/* isort - sort array a (dimension n) using insertion sort

*/

#include "local.h"

void isort(a, n)

short a[];  /* array to be sorted; a[0:n-1]: multiset */

index_t n;  /* dimension of a */

index_t i;

index_t j;

short t;

for (i = 1; i < n; ++i)

/* a[0:i-1] => sorted */

t = a[i];

for (j = i; j > 0 && a[j-1] > t; --j)

a[j] = a[j-1];  /* a[0:n-1] => !multiset */

a[j] = t;  /* a[0:n-1] => multiset */

/* a[0:n-1] => sorted */

During the inner loop, the array momentarily loses its "multiset" property, because the value a[i] has been removed into the temporary t. This is indicated by the comment in the inner loop. As soon as t is assigned into its proper place, the array once again becomes a multiset.

Assertions and run-time validation

We have placed comments in our code that indicate the intended invariant conditions, but unfortunately in real life, errors in creating and maintaining the program may cause such comments to be false and misleading. A more reliable way of stating such properties is to insert executable assertions in the code.

There are trade-offs to consider: The complexity of the assertion tests can exceed the complexity of the code being protected by the assertions, in which case the assertions can become a source of more, not fewer, errors. And the time to execute the assertions can swamp the execution time of the code itself. We do not mind if the assertions are more time-consuming than the program specification allows for the final production program; the assertions are all disabled by compilation with the NDEBUG flag.

Rule 3-6: Use executable assertions whenever they are simpler than the code being protected, and when the time to execute the assertions is not much greater than the time required to execute the code.

In the case of a sorting function, we could provide an executable method of asserting that a given array or subarray is sorted, something like this:

tst_sort:

/* tst_sort - returns YES if array a is sorted

* specialized "short" version

*/

#include "local.h"

bool tst_sort(a, n)

short a[];

index_t n;

index_t i;

if (n <= 0)

return (NO);

for (i = 1; i < n; ++i)

/* a[0:i-1] => sorted */

if (a[i] < a[i-1])      /* compare adjacent elements */

return (NO);

/* a[0:n-1] => sorted */

return (YES);

According to our guidelines, we would not test this assertion inside the loop of isort. The time to test the assertions would become many times greater than the time to execute the code itself. But we could replace the final comment in isort.c

/* a[0:n-1] => sorted */

with this executable assertion:

asserts(tst_sort(a, n-1), "a is sorted");

The assertion is not required in the production version of the function; it checks the logic of the function itself, not any data provided from outside the function. Compiling with NDEBUG would turn off code generation for the assertion. We are using the executable assertions for documentation and for assistance during development, not for run-time checking in the final product.

Now we will consider the same underlying data structure, with a new algorithm, "quicksort."

For consistency, we will create a function with a calling sequence similar to that above:

void qksort(a, n)

short a[];

index_t n;

Loop invariants

Given an array or subarray, we pick the middle element and call it the pivot. Then the array is rearranged: those elements less than the pivot are moved to the left of the pivot, and those greater than the pivot are moved to its right. For convenience, we start by swapping the pivot with the initial element. The lowest subscript is lo, the highest subscript is hi, the subscript of the rightmost "less-than-pivot" element is lastlo, and the subscript of the element currently being tested is i. (This quicksort is somewhat simpler than most; it is adapted from Bentley [1984a].) During the loop of the sort, the invariant looks like this:

When the loop terminates, this is the picture:

Swapping a[lo] with a[lastlo] gives

Thus we end up with two smaller subarrays, a[lo:lastlo-1] (all of whose elements are strictly less than a[lastlo]) and a[lastlo+1:hi] (all of whose elements are greater than or equal to a[lastlo]). If the array is trivial (only one element), then both subarrays are empty; there is no work to do, and the function returns immediately. If there is more than one element, at least one of the subarrays is non-empty. The function calls itself recursively for both subarrays. When it returns, the array a[lo:hi] is sorted.

Assertions and run-time validation

There is little to add here except to assert tst_sort after completion of the function. The function is so short that assertions during the algorithm would be needlessly complex in comparison.

Here is the qksort function:

qksort:

/* qksort - sort array a (dimension n) using quicksort */

#include "local.h"

/* iqksort - internal function to partition a[lo:hi] */

static void iqksort(a, lo, hi)

short a[];   /* array to be sorted; a[lo:hi]: multiset */

index_t lo;  /* lowest subscript */

index_t hi;  /* highest subscript */

index_t mid = lo + (hi-lo)/2;    /* : {lo:hi} */

index_t i;                       /* : {lo+1:hi+1} */

index_t lastlo;                  /* : {lo:hi-1} */

short tmp;

SWAP(a[lo], a[mid], tmp);

lastlo = lo;

for (i = lo + 1; i <= hi; ++i)

if (a[lo] > a[i])

++lastlo;

SWAP(a[lastlo], a[i], tmp);

SWAP(a[lo], a[lastlo], tmp);

if (lastlo != 0 && lo < lastlo - 1)

iqksort(a, lo, lastlo - 1);

if (lastlo + 1 < hi)

iqksort(a, lastlo + 1, hi);

/* qksort - external entry point */

void qksort(a, n)

short a[]; /* array to be sorted; a[0:n-1]: multiset */

index_t n; /* number of elements */

if (n > 1)

iqksort(a, 0, n - 1);

asserts(tst_sort(a, n), "a is sorted");

The computation of mid is chosen to avoid a possible overflow. In the simpler expression

(lo + hi) / 2

the sum is susceptible to overflow for large values.

In Plum and Brodie [1985], optimization of this function is discussed at length. It is important to note that this recursive version of quicksort is vulnerable to some array values that cause the depth of the recursion stack to almost reach the number of elements in the array.

3.3 Strings

Topics assumed from previous book: string [2.4].

In C, a string is an array of char objects, containing a nul terminator '\0' that marks the end of the current string value. A string constant is written with double-quotes, like

"hello, world\n"

C does not require that every array of char'S must be a string; indeed, the only built-in knowledge of "string" semantics is that string constants receive a nul terminator. Nonetheless, a char array is so frequently used to hold a string that we will identify a special property to signify a nul terminated string: an array has the string property if all the characters up to the first nul terminator are defined.

If a char array has an initializer, the initializer will determine whether the array has the string property initially.

char a1[2] = {'a', 'b'};

is not a string, because there is no nul terminator within a1. On the other hand,

char a2[2] = "a";  and

char a3[2] = {'a'};

are both strings, according to the properties of string constants and of static array initializers.

It is often a useful style to declare an array of char'S with an explicit +1 in its bound, to emphasize that space is reserved for the nul terminator. All other things being equal, algorithms involving char arrays are simpler if all the arrays involved remain proper nul-terminated strings during the execution.

3.4 Pointers as Array Parameters

Topics assumed from first book: pointers as function parameters [7.3], pointers and arrays [7.4].

In this book, we will follow the usage of declaring pointer parameters to scalars with the "asterisk" notation and declaring pointer parameters to arrays with the "empty brackets" notation. Thus the function definition

fn(a, b)

char *a;

char b[];

indicates that a points to a scalar char, and that b points to the initial element of an array of charS.

No strong semantics can be attached to the distinction, because the distinction is unavailable for non-parameter pointers. The distinction still seems worth keeping, because parameter pointers are so often used. This style leads to a more graceful notation for two-dimensional parameters:

fn(table)

char table[] [20];

looks more natural than

fn(table)

char (*table)[20];

Relying upon the foregoing context, we will henceforth refer to array parameters when talking about pointer parameters which point to the initial elements of arrays. Similarly, string parameters are pointer parameters which point to the initial char in a string.

Further generalizing, we will use the term array pointer for any pointer used to point to array elements. (The term "pointer into array" might be more precise, but the brevity of "array pointer" is worthwhile because the term will be used so often in discussing array algorithms.) Remember, it would be inaccurate to call such a pointer a "pointer to array" -- an array pointer points to single elements, not to an entire array.

Array pointers (including array parameters) are different from array objects. The pointer itself can be undefined, or defined. It can be NULL, or it can point to valid storage. The array object being accessed through the pointer has its own properties: It could be complete or incomplete. It could be a multiset array (either sorted or unsorted). It could be a string. Or it could have any other property that we have defined for arrays.

In order to talk sensibly about the object being referenced we will need some new terminology. Because of the way C references arrays through pointers, the array being accessed is not simply the "indirect object" of the pointer (*p), because *p is just the individual array element pointed to by the pointer. We define the target of an array pointer to be the array object being accessed through the pointer. We will use the notation p[*] to mean "the entire array to which the pointer provides access."

In defining a meaning for p[*], we have reached the heart of an important reliability issue in the use of C: given an array pointer, exactly what object (i.e., what range of addresses) can we reliably access using that pointer? In the freest and least reliable usages, a pointer can be used to give access to the entire data space of the program:

int i;

int *p = &i;

printf("%d\n", p[12000]);   wild and dangerous usage

shows an access to whatever object is 12,000 elements further into the memory. But certainly this is not what we have in mind in using a pointer to access an array; the target of the pointer is meant to be only that array into which it is pointing.

In other words, *p ("indirect p") is only one array element; p[*] is the entire array accessed through p. In this example

short a[10];

short *pa = a;

the indirect object of p (*p) is the element a[0]. The target of (p[*]) is the entire array a.

For single-dimension arrays, there is no ambiguity; in the example

short a[10];

short *pa = a;

it is clear that a is intended to be the target of pa. But what of this example:

short b[10] [10];

short *pb2 = b[2];

Now we must make a choice and be consistent about it. Did we mean that pb2 is meant only for providing access to b[2], the second row of b, or that it is meant to provide access to the entire array b? Our choice is that when an array pointer is given a value, by assignment or passing an argument, the actual right-hand-side expression being assigned will be taken to indicate the target of the array. Thus, in the second example, the target of pb2 is the array b[2] only.

One concession to common usage of C: The assignment

p = &a[n];

is taken to mean the same thing as

p = a + n;

which means that the entire array a is the target of the pointer.

One further notational convenience: The properties of the target can be attributed to the pointer itself. For example, "pointer p is a string" means that the target of p is an array with the "string" property. This allows more conventional and convenient documentation:

/* p: string */

means the same thing as

/* p[*]: string */

What are the practical implications of all these distinctions? There are two: determining the properties of the target of the pointer, and providing the possibility of range-checking the use of the pointer (i.e., avoiding "run-away pointers").

"Properties of the target" is the simpler of the two considerations. All that is meant here is that, for example, when an array of char'S is passed as the second argument to strcpy, that array should be a string. If we wish to state this constraint explicitly in documenting a function such as strcpy, we can use a notation like this:

char *strcpy(s1, s2)

char s1[];      /* s1:!NULL, sizeof(s1[*]) > strlen(s2) */

char s2[];      /* s2:string */

which means that the pointer s1 is not NULL, that the array to which s1 provides access is large enough to hold the result, and that the array to which s2 provides access is a string.

Regarding range-checking in reliable programming, we desire not to access any objects that are outside the extent of the pointer's target. (This is why the extent of the target must be unambiguous.) We will therefore adopt the convention that every pointer value has a range associated with it, which may actually be monitored by an interpreter or may just be a conceptualization guiding the programmer towards reliable code.

Section 2.6 discussed how scalar data objects could have a range associated with their declaration; the range lasts throughout the lifetime of the variable. For pointers, the situation is different; a pointer can be used to point into several different arrays as the computation progresses. Therefore, we will consider that the range of a pointer variable is dynamically set by each operation upon the pointer.

When a pointer is assigned an expression value involving an array target, the address range of the target determines the range of the pointer. When a pointer is assigned an expression value involving a pointer, the left-hand-side pointer acquires the range of the right-hand-side pointer. (By the rules of C, the right-hand-side can have only one array or pointer operand, so the source of range information is unambiguous.)

Before we say exactly what the range of the pointer is, we have to consider the possible "one-too-far" value of an array pointer. In a loop like

for (p = a; p < a + DIM(a); ++p)

process *p;

we are counting upon p being eventually incremented to the address which is one byte greater than the last byte of the array a. The validity of the "one-too-far" value was agreed upon by the ANSI C committee in March 1985; it is vital to many existing C programs, which would be needlessly complicated otherwise. We will therefore always include the "one-too-far" value in the range of a pointer, with the understanding that this value is only good for arithmetic and comparisons, not for indirection.

If a pointer receives a value outside its range, we will consider that pointer to be undefined.

Now it is time for some examples. The comments will indicate the hypothetical addresses of the array elements:

static char m[10][10] = {0};  /* hypothetical address range: 2000:2099 */

char *p;

p = m[0];       /* p == 2000; range {2000:2010} */

/* p can access all of m[0], plus "one-too-far" */

p = (char *)m;  /* p == 2000; range {2000:2100} */

/* p can access all of m, plus "one-too-far" */

p = m[2];       /* p == 2020; range {2020:2030} */

/* p can access all of m[2], plus "one-too-far" */

p = p + 1;      /* p == 2021; range {2020:2030} */

/* value of p changes, but not its range */

++p;            /* p == 2022; range {2020:2030} */

/* again, value changes, but not range */

p += 8;         /* p == 2030; range {2020:2030} */

/* p is now "one-too-far"; valid value, */

/* but cannot be used for indirection */

++p;            /* p is now out-of-range; "undefined," strictly */

These rules for the range of pointers are, at the least, useful guidelines for the writing of reliable programs. They also can be enforced by compilers and interpreters which do dynamic bounds-checking [3-1]. They reflect the actual practice of working C programmers who write reliable programs that do not make out-of-bounds references in their use of pointers. The existence of precise rules such as these means that C is a language which can have both the expressive freedom of pointers plus the reliability of bounds-checking.

These concerns for the reliable uses of array parameters motivate the creation of a new general-purpose string-copying function. We will describe here a new function named strfit, which ensures that the target string is both bounds-checked and nul-terminated. If insufficient space was provided, the target is truncated but still nul-terminated. The function returns a bool value: a return of YES indicates that there was sufficient space for a complete copy.

strfit:

/* strfit - copy s2 to s1, chopping to n chars

* return YES if sufficient space, no if not

*/

#include "local.h"

bool strfit(s1, s2, n)

char s1[];

char s2[];

size_t n;

size_t i;

for (i = 0; i = n-1 && s2[i] != '\0'; ++i)

s1[i] = s2[i];

s1[i] = '\0';

return (s2[i] == '\0');

3.5 Arrays of Strings

An array of strings can be a handy format for the storage of simple tables. Consider

static char meas_codes[][6] =

{"each", "box", "lb");

This defines an array meas_codes with three rows of six char'S each, which looks something like this:

If we have a string to look up in this table, we can use a simple loop like this:

for (i = 0; i = DIM(meas_codes); ++i)

if (strcmp(s, meas_codes[i]) == 0)

found a match

We are passing the expression meas_code[i] to the strcmp function. According to the rules above, this means that we expect strcmp to be able to examine all the characters of the i-th row of meas_codes, but not to walk around in any other storage.

The strings are all self-contained in the array meas_codes; the table does not itself contain any pointers. For that variety of table, we must cover pointers in more detail. The details of pointers are the subject of the next chapter.

Go to Chapter 4 Return to Table of Contents