Cargo-cult debugging

May 11, 2012 6:23 am

Programming

I’ve been coding full-time for only a few weeks, and already I’m going somewhat insane by people engaging in what I’d call cargo-cult debugging. Cargo cults were religions that developed when primitive societies, who’d had little exposure to any technology, were suddenly confronted with top-of-the-line modernism in the form of World War II military machines. When the armies disappeared at the conclusion of festivities, they took all of their modern marvels with them. The locals, believing that they’d observed gods bringing them promised goods, attempted to make the gods provide more cargo by building crude imitations of what they had seen—bamboo airplanes, fake landing strips, wooden radar dishes, and the like—without really having any proper idea what any of the things they were building actually were.

Cargo-cult debugging is when, having seen effective debugging, you imitate the motions, but without having any actual clue what you’re doing or why.

Let’s take a look at some cargo-cult debugging in action.

Problem Description

Two problems, likely unrelated, but including both for completeness since they started at about the same time: car tilts to the left and slightly backwards, and there is a grinding noise coming from the rear-left wheel-well.

Current Hypothesis

Engine belt is likely applying too much torque, resulting in car listing to one side.

Solutions Attempted

Tried turning on left blinker, since car tends to lean right during left-hand turns.

Tried having extremely fat man sit in passenger’s seat and drive car via strings attached to steering wheel to balance the frame.

Attempted to drive car backwards. (Note: visibility extremely hampered in this mode; file bug upstream with manufacturer.)

Tried covering wheel-well with thick cloth to muffle sound, thereby evening out the frame.

Tried driving with windows down to change airflow. (Note: test cut short due to sparks flying from rear-left wheel-well and a small resulting fire, which also prevented further testing.)

The structure of a good debugging session is there—the facts, the possible solutions, and a hypothesis—but even being very generous to the tester, their ideas and hypothesis make no sense. You’re going to solve the problem with exactly as much speed and precision as if you played Tetris for an hour and then changed five lines of code at random.

I’ve seen every developer engage in this behavior sometimes, and that includes me. Not only that; every developer I’ve seen who does this does it for one of two very closely related reasons:

They do not understand the problem.
They do understand the problem, but the most likely solution involves rewriting some very hard-to-understand code, or some code that they’re personally very attached to.

I’ve got a bug right now that’s right in between these two, and guess what? I’m guilty.

Our internal build of Kiln doesn’t run on 32-bit Windows 2003 systems, even though the 64-bit build works fine, and the 32-bit build works fine on all other Windows platforms we support; just not Windows 2003. The error message Windows generates is beyond unhelpful, and googling it indicates that the problem likely has to do with a bad manifest—something I barely know anything about. They go in resource forks, I modified one in Copilot once to get theming in Windows XP, and I seem to recall XML being involved. That’s it.

Here is the right way to address this bug:

Learn more about manifests, so I know what a good one looks like.
Take a look at the one we’re generating for Kiln; see if anything obvious screams out.
If so, dive into the build system [blech] and have it fix up the manifest, or generate a better one, or whatever’s involved here. This part’s a second black box to me, since the Kiln Storage Service is just a py2exe executable, meaning that we might be hitting a bug in py2exe, not our build system.
If not, burn a Microsoft support ticket so I can learn how to get some more debugging info out of the error message.

Here’s the first thing I actually did:

Look at the executable using a dependency checker to see what DLLs it was using, then make sure they were present on Windows 2003.

This is not the behavior of a rational man. But it’s the behavior of a man who’s flirting with the edge of his knowledge and has no desire to screw with the build system: fixing DLL dependencies involves tweaking a line or two in setup.py, which is way easier than learning a bunch of new stuff about manifests, diving into how py2exe makes its particular brand of sausage, and then patching it upstream or mucking about on the build server to add post- or pre-build steps as appropriate.

So here’s my call to action: do not engage in cargo-cult debugging. Whenever you’re about to try a debugging session, force yourself to answer the question, Why do I believe this is the most likely solution to the problem? If the answer is, “I have no idea,” or even worse, “It’s not, but it’s easier to check than this other more likely solution,” then keep looking. Is your app slow even though the CPU is basically idle? Increasing the size of the thread pool probably won’t help. Web service occasionally timing out? Your Redis server is unlikely to hold the answer. Occasionally getting your app’s network messages out-of-order? Attacking layer 4 of your networking stack with Wireshark is probably premature.

We developers pride ourselves on being “lazy” in the sense of “developing solutions that minimize the work we have to do.” If you engage in cargo-cult debugging, you’re being intellectually lazy, which is entirely different. Intellectual laziness turns you from a great developer into a mediocre one.

Be a great developer. Don’t engage in cargo-cult debugging.