3) When Something Works One Way but Fails in a Different Way, Focus on the Differences

What developer hasn't been in the situation where code works perfectly fine on his computer but fails on someone else's machine? For most programmers, it's probably the number one most frustrating situation to debug. "But my code clearly works (on my machine)! See, I can prove it! How can I fix a problem when there is no problem (on my machine)?" Many developers immediately go into denial by assuming the failure must be due to user error, and in fairness, it often is. But once user error is ruled out, a feeling of vague helplessness often sets in.

Less common but even more frustrating is code that works in one way but not in another, even on the same machine. Have you ever seen code that crashes when you do X, but if you first do Y (which is not related to X in any way, shape, or form), then X runs fine? Or have you ever encountered a common code library that works fine when called from one function but fails when called from another, even though you call the library code with the exact same program arguments both times? That isn't too terrible when you have the source code for the library and can debug into it, but once you get in this situation with an API call to a closed-source operating system, you'll feel the pain.

The reason that most developers dread situations like this is because this type of symptom usually indicates a low-level bug that will require deep system knowledge to track down. After all, the code is clearly on the right track since it does work at least some of the time. If you're using an unsafe language like C++, then you can always hypothesize the inconsistent behavior is due to uninitialized memory; but lots of development tools (and compiler warnings) exist for automatically detecting those. The new .NET languages reduce the possibility of that kind of bug anyway, so the problem in this situation is usually a configuration issue.

Maybe your project's compiler settings are configured wrong, or maybe some code that ran prior to your code has messed something up. Maybe there's something about one computer that's different from the other (a smaller hard disk, a Windows password policy, a different service pack version, etc.). Whatever it is, those kinds of configuration details often makes these bugs painful. But as painful as these bugs are, they'd be ten times more painful if you didn't have the knowledge about it sometimes working and sometimes not. The fact that it works in one place but not in another provides a huge clue about where to look. All you have to do is compare the differences between the place where it works and the place where it fails.

Focus on the Differences

Don't focus on the similarities between working and nonworking features. Focus on the differences! For each possible explanation of the bug you can think of, ask yourself, "How does that explain why the code works one way but fails the other way?" If your theory can't explain this, your theory is wrong. Throw it away and investigate something else instead. Does the nonworking machine have a different set of system libraries installed than the working machine? What is different about the two machines? Once you figure that out, you've nailed the bug. Then it should be easy to adjust the code to take care of the problem.

Does focusing on the differences sound like what most developers do, though? Unfortunately not. When code works for the developer but not for the tester, most developers will step over the code on their own machine anyway—even though that code doesn't exhibit the bug. Actually, this is not a bad first attempt. Even though you won't be able to reproduce the bug on your machine, by stepping over the code you may still notice a possible difference worth investigating. Maybe you'll notice, "Aha, here the code assumes the latest version of Microsoft Internet Explorer is installed, which is true for my machine, but maybe that's the problem with the other guy's computer." So stepping over the code, even on a computer that doesn't exhibit the problem, doesn't hurt and may well help.

But it's important that you don't stop there. If stepping over the code doesn't locate the problem, then the next round of debugging needs to search for the differences. It might be an option to install debugging tools on the remote machine and debug there (but see rule 5: "Keep a Few Computers Where Debugging Tools Are Never Installed"). If so, you can try that out. That may not be an option, though, in which case you can try making a list of all the possible ways the machine in question might be different from yours in a manner that could explain the problem. Then maybe you can try installing a version of your product with increased programming logging to help pinpoint the problem.

When Something Works One Way…

In Microsoft Windows, you can dynamically load a DLL function at runtime. The main reason for this is to gracefully handle instances when the correct version of a library is not installed. The downside, though, is that even in C# or VB .NET, the syntax is a little odd, and in legacy, non-managed C++, the syntax is downright atrocious:

//Error checking omitted here for brevity
//Load the library and dynamically find the function
HINSTANCE libHandle = LoadLibrary("SomeLibrary.dll");
typedef int (APIENTRY *ApiFunction)(void* lpVoid);
ApiFunction myFunctionPointer =
   (ApiFunction)GetProcAddress(libHandle, "FunctionName");
//This next line invokes the function we care about
int retVal = (*myFunctionPointer)(NULL);
...
FreeLibrary(libHandle);

Yeech. But the power of dynamic linking is great, so hardcore Windows programmers use it often.

I once had a bug in which the preceding code crashed on the call to myFunctionPointer whenever I ran a certain feature. I knew this exact same code was also called from a second feature, so I restarted the program, ran a test of the second feature, and the program worked fine. Then I ran the first feature again, and it didn't crash either. Huh? A pattern emerged—running the first feature would always crash, unless I ran the second feature first. Once I ran the second feature, then the first feature would work fine until I restarted the program. But the second feature used the exact same code as the first feature! How could the exact same code consistently crash when called from one place, but consistently work when called from another place? Talk about frustration.

But such situations are frustrating only for Mr. The-Glass-Is-Half-Empty. Think about it from a positive viewpoint. The function does work (at least some of the time), so that instantly eliminates a hundred possible theories. One explanation might be that the DLL was not installed, or was corrupted, or was the wrong version. But since the code does work some of the time, you can rule out those possibilities: Clearly, the DLL is valid. Another possibility is that you misread the documentation for the function you're dynamically invoking. But no, that can't be it either because the code does work some of the time. In fact, there are a tremendous number of possibilities that you don't even need to consider at all because the fact that the code does sometimes work rules them out. This translates into a huge timesaver for you.

What would you do here? Ask yourself questions: Does the working code define preprocessor symbols that the nonworking code doesn't? Does the non-working code only get called after running some other function that the working code never sees? Is the working code using different project settings than those of the nonworking code? The code example I listed earlier uses the same function arguments for both the working and nonworking code, but if you were seeing this behavior with your own code, could you be certain that both paths used the same arguments?

Don't be afraid when you see situations like this. "Well, it may work on the developer's computer, but it doesn't work on mine" is not as scary as it sounds as long as you remember to focus on the differences.