[ Team LiB ] |
![]() ![]() |
7.2 Performance TuningWriting efficient code is something of an art. Efficiency is not about rewriting your application in assembly code or anything that drastic. Efficiency is about writing software that meets whatever design criteria you set for it. If a design specification states that an application should "run as fast as possible," you need to rewrite the specification. It is far better to say that particular operations need to execute in a specific amount of time (on a specific platform). For many applications, especially image processing ones, performance is a very important issue. However, it is surprising how many design documents do not address performance in a concrete way. Let's look at it another way. If you want to train to become a sprinter, you know that you need to run very fast for a relatively short period of time. If you were to write a list of goals for yourself, would you include a statement that says you should "run as fast as possible"? Of course you wouldn't. You would probably say that you need to be able to run a certain distance in a certain amount of time. And based on this goal, you can set up a training program to meet it. Writing software is no different. You need to have plausible goals and then you can design a plan to reach them. Your goals can be difficult, but not impossible, to achieve. Sometimes a goal may seem impossible, but further investigation reveals it is actually possible, though extremely challenging. Having a goal that is well defined is absolutely essential for getting the performance you need from your application. See [Bulka99] for useful information on writing efficient C++ code. 7.2.1 General GuidelinesIt is not always easy to decide when you need to worry about performance. Our recommendation is to assume that a piece of software does not have any special performance criteria unless you know this statement to be false. Avoid writing a highly optimized piece of software to solve a timing problem that may not even be a problem. It is better to design a reasonable solution first, and then discover that it must run faster. This iterative approach to design helps you reduce development time and wasted effort. It is true that you can have cases where your product does not meet the expectations of its testers on the first iteration, but this is what an iterative design approach is all about. The product design specification needs to be as clear as possible regarding the desired functionality and performance of the application. For example, let us look at how a customer interacts with a graphical user interface (GUI). We see that the overhead of the framework and the way you write your code often has little effect on performance. The customer communicates with the software by making various requests in the form of mouse or other pointer manipulation or keyboard input. For complicated interfaces, these requests occur at no more than one request per second. The steps that the software takes to process such a request can be listed as a series of events:
If the customer generates events at the rate of one request per second, then this sequence of events, including updating the user interface, must happen in no more than half that time. Where did this number, .5 seconds, come from? It is simply a guess based upon our perception of how customers operate such systems. Now, without worrying about specific operating systems or GUI implementations, we can make some assumptions about how long it takes to handle each of the above steps. Receiving and invoking an event handler is a fast operation, even when written in a general purpose C++ framework. This step comprises a number of table lookups, as well as one or more virtual function calls to locate the owner of an event. It certainly does not consume a lot of time. Processing the event is strictly an application-specific task, as is updating the user interface. The amount of overhead that can be tolerated in this example is fairly large. The customer will have a very hard time distinguishing between one millisecond and 50 milliseconds. As a contrasting example, let's look at a real-time system that has hard performance requirements. As opposed to the previous example, where there is a person waiting to see results updated on the screen, these results are needed in a certain period of time, or else the information is useless. The framework's overhead, as well as how the processing is written, is important. However, it is not so obvious how much overhead is acceptable. We have found that dealing with percentages makes it easier to gauge how overhead can be tolerated. If your processing function does not spend 98 to 99 percent of its time doing actual work, you should examine your design more closely. For example, for very fast processing cycles, say five milliseconds, the framework overhead should be kept to less than 100 microseconds. For slower real-time systems that require about 50 milliseconds to execute, overhead should be less than one millisecond. The design of each of these systems will be very different. To measure overhead for an image processing system, or other system that performs a repeated calculation on a number of samples, it is customary to compute the overhead on a per row or per pixel basis. Let us assume that we must perform one or more image processing steps on a 512x512 pixel image. If we want to keep the total overhead to one millisecond, the per-row overhead can be no more than two microseconds. Two microseconds is quite a bit of time for modern processors, and this permits numerous pointer manipulations or iterator updates. We are not so lucky if we have a goal of only two-tenths of a microsecond per row. In this case, you should consider optimizing your code from the onset of the design. If you find this calculation too simplistic for your taste, you can write some simple prototypes to measure the actual performance on your platform and decide if any code optimization is needed. Since many image processing functions are easy to write, you can get a sense for how much time the operation takes. Our unit test framework is a convenient starting point, since the framework computes and reports the execution time of each unit test function. To get started, you would need to write at least two functions. The first function would contain the complete operation you want to test. The second function would contain just the overhead components. The execution time of the first test function tells us how long the image processing takes, while the second tells us if the overhead itself is significant. If our unit test framework was more complicated than it is, we would also need a third function to measure the overhead of the unit test framework itself. But, since the framework is so simple, its overhead is negligible. 7.2.2 Thirteen Ways to Improve PerformanceWithout getting specific about the dynamics of your application, we offer thirteen techniques that you can use to improve the overall performance of your software. Figure 7.3. Performance Checklist
Each of these techniques is described in more detail in this section.
7.2.3 Image-Specific ImprovementsImage processing algorithms can be optimized in many ways because of their symmetry. To demonstrate this, we'll start with a simple function that computes the sum of all pixels in an image and optimize it in a number of iterations.
The easiest way to write this, and probably the slowest, is: UTFUNC(slow) { setDescription ("Slow version"); apRect boundary (0, 0, 512, 512); apImage<Pel8> byteImage (boundary); byteImage.set (1); unsigned int sum = 0; for (unsigned int y=0; y<byteImage.height(); y++) { for (unsigned int x=0; x<byteImage.width(); x++) { sum += byteImage.getPixel (x, y); } } VERIFY (true); // So this test will pass } We can rewrite this to take advantage of our iterators. Because this version runs so quickly, we have to run it multiple times to get reported execution times greater than zero.
The rewritten function is as follows: UTFUNC(original) { setDescription ("Original version, sum all pixels"); apRect boundary (0, 0, 512, 512); apImage<Pel8> byteImage (boundary); byteImage.set (1); // Run many times to get an accurate measurement for (int iterations=0; iterations<1000; iterations++) { unsigned int sum = 0; apImage<Pel8>::row_iterator i; for (i=byteImage.row_begin(); i != byteImage.row_end(); i++) { const Pel8* p = i->p; for (unsigned int x=0; x<byteImage.width(); x++) { sum += *p++; } } } VERIFY (true); // So this test will pass }
Now, we rewrite the original function, so that it has the same setup, but performs no image processing, in order to measure overhead. Note that we perform one trivial operation per row, to make sure that the compiler doesn't optimize the loop for us. UTFUNC(overhead) { setDescription ("Overhead version, sum all pixels"); apRect boundary (0, 0, 512, 512); apImage<Pel8> byteImage (boundary); byteImage.set (1); for (int iterations=0; iterations<1000; iterations++) { unsigned int dummy = 0; apImage<Pel8>::row_iterator i; for (i=byteImage.row_begin(); i != byteImage.row_end(); i++) { const Pel8* p = i->p; dummy += *p++; // Prevents compiler optimization of loop } } VERIFY (true); // So this test will pass } The execution time of original includes the execution time of the unit test framework, as well as the actual image processing operation. The execution time of overhead also includes the unit test framework overhead, but only includes the overhead portion of the image processing operation, and not the actual time required for the operation. If the overhead of the unit test framework is significant, then we need a third function that measures the overhead of the unit test framework. Because our framework has very low overhead, we can skip this third test.
Let's further optimize the function by removing width() from the loop and placing it into a local variable, as shown. UTFUNC(original_width) { setDescription ("Original version with width()"); apRect boundary (0, 0, 512, 512); apImage<Pel8> byteImage (boundary); byteImage.set (1); for (int iterations=0; iterations<1000; iterations++) { unsigned int sum = 0; unsigned int w = byteImage.width(); apImage<Pel8>::row_iterator i; for (i=byteImage.row_begin(); i!=byteImage.row_end(); i++) { Pel8* p = i->p; for (unsigned int x=0; x<w; x++) { sum += *p++; } } } VERIFY (true); // So this test will pass } Let's do one more optimization.
We can do some basic loop unrolling by expanding the inner loop, so that multiple pixels are handled in each iteration, as shown. UTFUNC(original_unrolled) { setDescription ("Original version with unrolling optimization"); apRect boundary (0, 0, 512, 512); apImage<Pel8> byteImage (boundary); byteImage.set (1); for (int iterations=0; iterations<1000; iterations++) { unsigned int sum = 0; unsigned int w = byteImage.width() / 8; apImage<Pel8>::row_iterator i; for (i=byteImage.row_begin(); i!=byteImage.row_end(); i++) { Pel8* p = i->p; for (unsigned int x=0; x<w; x++) { sum += *p++; sum += *p++; sum += *p++; sum += *p++; sum += *p++; sum += *p++; sum += *p++; sum += *p++; } } } VERIFY (true); // So this test will pass } As written, this version works for images whose width are multiples of eight pixels. Extending this to work with an arbitrary pixel value would not be difficult. Execution ResultsWe ran these tests using Microsoft Visual Studio 7.0 on an Intel Pentium 4 processor-based machine, running at 2.0 GHz. Table 7.1 shows the times for a single iteration. The actual times are lower than this, because we shut off optimization to prevent the compiler from optimizing the loops.
Comparing the original version to original_unrolled shows over a 20 percent decrease in execution time. This is a significant gain for such a simple function. However, the inner loop for most image processing functions is more complicated. We measured the overhead of executing the loops and found it to be 0.03 milliseconds, or about 3 percent of our most optimized example. Let's look at some of the execution times for our existing image processing functions. Table 7.2 shows the execution times for a Laplacian filter running on a 1024x1024 8-bit monochrome image.
convolve() is our general-purpose convolution routine that contains a number of nested loops and fetches the kernel values from an array, resulting in the longest execution time. laplacian3x3() only has two loops, because it uses hard-coded statements to compute the output value for each pixel in the image, and this reduces the execution time significantly. The Intel IPP routine directly uses the power of the Intel Pentium 4 processor to achieve the fastest execution time. 7.2.4 A Note About Timing Your CodeMeasuring the execution time of critical sections of your code is not always a simple task, especially if you want to establish the worse-case execution time. In our previous examples, we measured the average time, because we ran a number of iterations, and then divided the total execution time by the number of iterations. Here are some of the reasons why this value will not be the worst case:
The timing routines defined in time.h provide standard, but very coarse, measurements of elapsed time. Most platforms include higher resolution facilities to measure time, and if you are designing for a single platform, we encourage you to take advantage of them. For example, under Microsoft Windows, you can take advantage of timers that provide sub-microsecond resolution. Instead of using this functionality directly (QueryPerformanceCounter), you should create a generic wrapper object and reuse the interface on other platforms, as shown. class apHiResElapsedTime { public: apHiResElapsedTime (); // Record the current time double usec () const; // Return elapsed time, in microseconds. double msec () const; // Return elapsed time, in milliseconds. double sec () const; // Return elapsed time, in seconds void reset (); // Reset the current time private: LONGLONG starting_; // Starting time }; apHiResElapsedTime::apHiResElapsedTime () : starting_ (0) { reset ();} double apHiResElapsedTime::sec () const { LARGE_INTEGER t, freq; QueryPerformanceCounter (&t); QueryPerformanceFrequency (&freq); return (double(t.QuadPart - starting_)) / freq.QuadPart; } void apHiResElapsedTime::reset () { LARGE_INTEGER t; QueryPerformanceCounter (&t); starting_ = t.QuadPart; } |
[ Team LiB ] |
![]() ![]() |