What was your most difficult bug hunt and how did you find it and kill it

bugexperience

This is a "Share the Knowledge" question. I am interested in learning from your successes and/or failures.

Information that might be helpful…

Background:

  • Context: Language, Application,
    Environment, etc.
  • How was the bug identified ?
  • Who or what identified the bug ?
  • How complex was reproducing the bug ?

The Hunting.

  • What was your plan ?
  • What difficulties did you encounter ?
  • How was the offending code finally found ?

The Killing.

  • How complex was the fix ?
  • How did you determine the scope of the fix ?
  • How much code was involved in the fix ?

Postmortem.

  • What was the root cause technically ? buffer overrun, etc.
  • What was the root cause from 30,000 ft ?
  • How long did the process ultimately take ?
  • Were there any features adversely effected by the fix ?
  • What methods, tools, motivations did you find particularly helpful ? …horribly useless ?
  • If you could do it all again
    ?…………

These examples are general, not applicable in every situation and possibly useless. Please season as needed.

Best Answer

It was actually in a 3rd party image viewer sub-component of our application.

We found that there were 2-3 of the users of our application would frequently have the image viewer component throw an exception and die horribly. However, we had dozens of other users who never saw the issue despite using the application for the same task for most of the work day. Also there was one user in particular who got it a lot more frequently than the rest of them.

We tried the usual steps:

(1) Had them switch computers with another user who never had the problem to rule out the computer/configuration. - The problem followed them.

(2) Had them log into the application and work as a user that never saw the problem. - The problem STILL followed them.

(3) Had the user report which image they were viewing and set up a test harness to repeat viewing that image thousands of times in quick succession. The problem did not present itself in the harness.

(4) Had a developer sit with the users and watch them all day. They saw the errors, but didn't notice them doing anything out of the ordinary to cause them.

We struggled with this for weeks trying to figure out what the "Error Users" had in common that the other users didn't. I have no idea how, but the developer in step (4) had a eureka moment on the drive in to work one day worthy of Encyclopedia Brown.

He realized that all the "Error Users" were left handed, and confirmed this fact. Only left-handed users got the errors, never Righties. But how could being left handed cause a bug?

We had him sit down and watch the left-handers again specifically paying attention to anything they might be doing differently, and that's how we found it.

It turned out that the bug only happened if you moved the mouse to rightmost column of pixels in the image viewer while it was loading a new image (overflow error because the vendor had a 1-off calculation for mouseover event).

Apparently, while waiting for the next image to load, the users all naturally moved their hand (and thus the mouse) towards the keyboard.

The one user who happened to get the error most frequently was one of those ADD types that compulsively moved her mouse around a lot impatiently while waiting for the next page to load, thus she was moving the mouse to the right much more quickly and hitting the timing just right so she did it when the load event happened. Until we got a fix from the vendor, we told her just to let go of the mouse after clicking (next document) and not touch it until it loaded.

It was henceforth known in legend on the dev team as "The Left Handed Bug"