4) Never Assume the Problem Is a Bug in the Hardware, API, or OS Without Reasonable Proof

Have you ever spent hours staring at a bug, and finally declared, "This has to be a bug in the compiler/operating system/API"? What percentage of those times were you right? It does sometimes happen. Compilers sometimes generate incorrect code. APIs don't always work as documented. And as of this writing, .NET is still a v1.0 product, so you know it's not perfect yet. But the majority of the bugs will be your own fault. The.NET CLR went through a heck of a lot of testing before release—probably far more testing than your product gets. Of course, bugs do occasionally slip through. It's perfectly OK to send your vendor a code sample proving the bug is in their product and demanding a fix ASAP. But first you have to prove the bug is in their code, not yours. And by "proof," I don't mean looking at your code for a few minutes, and not seeing any obvious error, so automatically assuming it must be someone else's bug.

So What Was the Problem?

The bug I mentioned in the preceding text turned out to involve some hard-core knowledge of Windows. Warning: Geeky, low-level details follow.

The LoadLibrary function has a successor: LoadLibraryEx, which allows you to specify several options. One of these options is the flag DONT_RESOLVE_DLL_REFERENCES. This tells LoadLibraryEx to merely load the DLL, but not to invoke the DllMain initialization routine. Why would anyone want that? One reason might be if you merely wanted to see whether the DLL could be loaded, without paying the price of doing process and thread initialization. But here's the real kicker—if you load the library in this uninitialized state, then subsequent calls to LoadLibrary where you do want the DLL's initialization to run will be ignored. It works in reverse, too—if you've already loaded the library with initialization, then attempts to load the library without initialization will have no effect, either.

This bug was caused because my first feature quickly verified the DLL existed by using LoadLibaryEx to load it without initialization. Then, it ran the code we discussed earlier, which crashed because the DLL was still uninitialized. However, my second feature didn't need to verify the DLL's existence, so it initialized the DLL by calling LoadLibrary. That worked. Since I wasn't unloading the DLL with the FreeLibrary function, the initialized DLL stayed in memory, which meant subsequent calls to the first function worked fine. Fixing the problem just meant initializing the DLL in both places.

Because I knew the function worked in the second feature but not in the first, I was able to focus on the differences by comparing the two code paths. That tipped me off to the solution. But if I hadn't been given that hint about the differences? Then this bug might have taken far longer to resolve.

By proof, I mean carefully rereading the documentation for that API twice to make sure you're using it correctly. Then writing the smallest possible test driver that does nothing except call the function that is failing. Then doing a web search for references to the API to see if anyone has experienced similar behavior before and already found a workaround. Then asking a coworker to look over your shoulder and double-check the logic of the code to make sure you did everything correctly. Once you've done all that, then you can assume the bug might be in the tools. But until you've gone through these steps, chances are that most bugs will be in your code. Assign blame for bugs the same way a jury determines guilt or innocence—a conviction requires "proof beyond a reasonable doubt."

The Burden of Proof

The beauty of this method is that you aren't wasting time either way. If the bug is yours, then writing the smallest possible test driver should help you figure out the bug sooner: If it works in your test driver but not in your application, then you just have to figure out what's different between the two. Just comment out more and more code from your application until it starts to look just like your test driver and you find the difference. But suppose the bug really is in the OS or the compiler or the third-party component. Well, you still haven't wasted any time because if you were to demand a fix from the compiler vendor, the very first thing the vendor would ask is, "Can you send us the smallest possible test driver that duplicates the problem so we can reproduce the bug here?" You would have had to do all that work of writing a test driver anyway, so you may as well do it up front.

What's wrong with falsely assuming the bug is in the compiler or the OS or the .NET runtime? The main thing is that once you decide the issue is caused by someone else's code, you often unconsciously stop trying to fix it. If the bug is minor, most programmers will defer it since it's a third-party issue and there's nothing they can do about it. If the bug is major, then most programmers will call their third-party vendor demanding a workaround, and several days will be wasted explaining the issue to the vendor and waiting for them to investigate what turns out to be your problem after all. And of course, if you get a reputation with that vendor for filing lots of user error bug reports, then they may be less responsive to the next issue you report. And maybe that next issue really will be a bug on their end, and what will you do then?

There's a certain pleasant feeling when you convince yourself the bug is not yours—first you get the ego boost of having written bug-free code, and second you can tell yourself that this bug is out of your control and you therefore don't have to worry about it anymore. I hope you understand why those attitudes are dangerous, but even leaving that aside, you should always be hoping for the bug to be in your code anyway. Think about it this way—if the bug or the performance problem turns out to be in the operating system, there's nothing you can do about it. But if the bug is in your code, then nothing in the world is stopping you from fixing it.

When confronted with a performance problem that seems to involve the network, it's best to maintain the attitude, "Well, I think the problem has to do with network factors beyond my control; but just to be sure, I'll get out a code profiler and check." This way we can see if 80 percent of the time is actually being spent making network calls or if it's spent inside our own code. And even if the time is being spent in network calls or in APIs, that doesn't mean there's nothing we can do about it. We can reexamine the code to see whether some of that network traffic could be eliminated—maybe this loop contains three remote LDAP calls that could be batched in to one call, or better yet, moved outside of the loop entirely.

Maybe that time-consuming OS function call does more work than we actually need, and a different (faster) OS function would be sufficient. Maybe we can use psychological tricks to make the program look faster than it actually is. Ever notice how every Microsoft product displays a brightly colored splash screen ("This is Microsoft Office XP!") for a second or two when you start the program? This isn't just to remind you what program you're using. It's primarily there because the program takes several seconds to load, but if a pretty picture is immediately displayed for a second or two while loading continues in the background, then the user gets the impression that the program started up more quickly than it actually did. (What, you thought those splash screens were just for marketing reasons?)

Most of the time, the odds are that the bug or the performance issue is your problem, not the compiler's or the OS's or the third-party component's. Know these odds, and don't bet against them until you have reasonable proof.