Discussion
Team LiB
Previous Section Next Section

Discussion

This Item illustrates one significant balance point between Items 8 and 9, "don't optimize prematurely" and "don't pessimize prematurely." That makes this a tough Item to write, lest it be misconstrued as "premature optimization." It is not that.

Here's the background and motivation: Memory and disk capacity continue to grow exponentially; for example, from 1988 to 2004 disk capacity grew by about 112% per year (nearly 1,900-fold growth per decade), whereas even Moore's Law is just 59% per year (100-fold per decade). One clear consequence is that whatever your code does today it may be asked to do tomorrow against more datamuch more data. A bad (worse than linear) asymptotic behavior of an algorithm will sooner or later bring the most powerful system to its knees: Just throw enough data at it.

Defending against that likely future means we want to avoid "designing in" what will become performance pits in the face of larger files, larger databases, more pixels, more windows, more processes, more bits sent over the wire. One of the big success factors in future-proofing of the C++ standard library has been its performance complexity guarantees for the STL container operations and algorithms.

Here's the balance: It would clearly be wrong to optimize prematurely by using a less clear algorithm in anticipation of large data volumes that may never materialize. But it would equally clearly be wrong to pessimize prematurely by turning a blind eye to algorithmic complexity, a.k.a. "big-Oh" complexity, namely the cost of the computation as a function of the number of elements of data being worked on.

There are two parts to this advice. First, even before knowing whether data volumes will be large enough to be an issue for a particular computation, by default avoid using algorithms that work on user data (which could grow) but that don't scale well with data unless there is a clear clarity and readability benefit to using a less scalable algorithm (see Item 6). All too often we get surprised: We write ten pieces of code thinking they'll never have to operate on huge data sets, and then we'll turn out to be perfectly right nine of the ten times. The tenth time, we'll fall into a performance pitwe know it has happened to us, and we know it has happened or will happen to you. Sure, we go fix it and ship the fix to the customer, but it would be better to avoid such embarrassment and rework. So, all things being equal (including clarity and readability), do the following up front:

  • Use flexible, dynamically-allocated data and instead of fixed-size arrays: Arrays "larger than the largest I'll ever need" are a terrible correctness and security fallacy. (See Item 77.) Arrays are acceptable when sizes really are fixed at compile time.

  • Know your algorithm's actual complexity: Beware subtle traps like linear-seeming algorithms that actually call other linear operations, making the algorithm actually quadratic. (See Item 81 for an example.)

  • Prefer to use linear algorithms or faster wherever possible: Constant-time complexity, such as push_back and hash table lookup, is perfect (see Items 76 and 80). O(log N) logarithmic complexity, such as set/map operations and lower_bound and upper_bound with random-access iterators, is good (see Items 76, 85, and 86). O(N) linear complexity, such as vector::insert and for_each, is acceptable (see Items 76, 81, and 84).

  • Try to avoid worse-than-linear algorithms where reasonable: For example, by default spend some effort on finding a replacement if you're facing a O(N log N) or O(N2) algorithm, so that your code won't fall into a disproportionately deep performance pit in the event that data volumes grow significantly. For example, this is a major reason why Item 81 advises to prefer range member functions (which are generally linear) over repeated calls of their single-element counterparts (which easily becomes quadratic as one linear operation invokes another linear operation; see Example 1 of Item 81).

  • Never use an exponential algorithm unless your back is against the wall and you really have no other option: Search hard for an alternative before settling for an exponential algorithm, where even a modest increase in data volume means falling off a performance cliff.

Second, after measurements show that optimization is necessary and important, and especially if it's because data volumes are growing, focus on improving big-Oh complexity rather than on micro-optimizations like saving that one extra addition.

In sum: Prefer to use linear (or better) algorithms wherever possible. Avoid polynomial algorithms where reasonable. Avoid exponential algorithms with all your might.

    Team LiB
    Previous Section Next Section