“Works on the machine” – How to fix non-reproducible bugs

Very occasionally, despite all testing efforts, I get hit with a bug report from a customer that I simply can't reproduce in the office.

"Works on my machine" syndrome]
(Apologies to Jeff for the 'borrowing' of the badge)

I have a few "tools" that I can use to try and locate and fix these, but it always feels a bit like I'm knife-and-forking it:-

Asking for more and more context from the customer: (systeminfo)
Log files from our application
Ad-hoc tests with the customer to attempt to change the behaviour
Providing customer with a new build with additional diagnostics
Thinking about the problem in the bath…
Site visit (assuming customer is somewhere warm and sunny)

Are there set procedures, or other techniques than anyone uses to resolve problems like this?

Best Answer

One of the attributes of good debuggers, I think is that they always have a lot of weapons in their toolkit. They never seem to get "stuck" for too long and there is always something else for them to try. Some of the things I've been known to do:

ask for memory dumps
install a remote debugger on a client machine
add tracing code to builds
add logging code for debugging purposes
add performance counters
add configuration parameters to various bits of suspicious code so I can turn on and off features
rewrite and refactor suspicious code
try to replicate the issue locally on a different OS or machine
use debugging tools such as application verifier
use 3rd party load generation tools
write simulation tools in-house for load generation when the above failed
use tools like Glowcode to analyse memory leaks and performance issues
reinstall the client machine from scratch
get registry dumps and apply them locally
use registry and file watcher tools

Eventually, I find the bug just gives up out of some kind of awe at my persistence. Or the client realises that it's probably a machine or client side install or configuration issue.