C++ – Debugging memory corruption

cdebuggingmemory

First off, I do realize this is not a perfect Q&A style question with an absolute answer, but I can't think of any wording to make it work better. I don't think there is an absolute solution to this and this is one of the reasons why I'm posting it here instead of Stack Overflow.

Over the last month I have been rewriting a fairly old piece of server code (mmorpg) to be more modern and easier to extend/mod.
I started with the network portion and implemented a 3rd party library (libevent) to handle stuff for me.
With all the re-factoring and code changes I introduced memory corruption somewhere and I have been struggling to find out where it happens.

I can't seem to reliably reproduce it on my dev/test environment, even when implementing primitive bots to simulate some load I don't get crashes any more (I fixed a libevent issue which caused some stuff)

I have tried so far:

Valgrinding the hell out of it – No invalid writes until the thing crashes (which may take 1+ day in production.. or just an hour) which is really baffling me, surely at some point it would access invalid memory and not overwrite stuff by chance? (Is there a way to "spread out" the address range?)

Code-Analysis tools, namely coverity and cppcheck. While they did point out some.. nastiness and edge cases in the code there was nothing serious.

Recording the process until it crashes with gdb (via undodb) and then working my way backwards. This /sounds/ like it should be doable, but I either end up crashing gdb by using the auto-complete feature or I end up in some internal libevent structure where I get lost since there's too many possible branches (one corruption causing another and so on).
I guess it would be nice if I could see what a pointer originally belongs to/where it was allocated, that would eliminate most of the branching-issues. I cant run valgrind with undodb though, and I the normal gdb record is unusably slow (if that even works in combination with valgrind).

Code review! By myself (thoroughly) and having some friends look over my code, though I doubt it was thorough enough. I was thinking about maybe hiring a dev to do some code review/debugging with me, but I cant afford to put too much money in it and I wouldn't know where to look for someone who'd be willing to work for little-to-no money if he doesn't find the issue or anyone qualified at all.

I should also note: I usually get consistent backtraces.
There are a few places where the crash happens, mostly related to the socket class becoming corrupted somehow. Be it an invalid pointer pointing to something which isn't a socket or the socket class itself becoming overwritten (partially?) with gibberish. Although I suspect it's crashing there the most since that's one of the mostly used parts, so it's the first corrupted memory which gets used.

All in all this issue has had me busy for nearly 2 month (on and off, more of a hobby project) and is really frustrating me to the point where I become grumpy IRL and think about just giving up.
I just can't think about what else I am supposed to do to find the issue.

Are there any useful techniques I missed?
How do you deal with that? (It can't be that common since there isn't much information about this.. or I'm just really blind?)

Edit:

Some specs in case it matters:

Using c++(11) via gcc 4.7 (version supplied by debian wheezy)

The codebase is around 150k lines

Edit in response to david.pfx post: (sorry for the slow response)

Are you keeping careful records of crashes, to look for patterns?

Yes, I still have dumps of the recent crashes lying around

Are the few places really similar? In what way?

Well, in the most recent version (they seem to change whenever I add/remove code or change related structures) it would always get caught in an item timer method. Basically an item has a specific time after which it expires and it sends updated info to the client.
The invalid socket pointer would be in the (still valid as far as I can tell) Player class, mostly related to that.
I am also experiencing loads of crashes in the cleanup phase, after the normal shutdown where it's destroying all the static classes that haven't been explicitly destroyed (__run_exit_handlers in the backtrace). Mostly involving std::map of one class, guessing that's just the first thing that comes up though.

What does the corrupt data look like? Zeros? Ascii? Patterns?

I haven't found any patterns yet, seems somewhat random to me. It's hard to tell since I don't know where the corruption started.

Is it heap-related?

It's entirely heap-related (I enabled gcc's stack guard and that didn't catch anything).

Does the corruption happen after a free()?

You're going to have to elaborate a bit on that one.
Do you mean having pointers of already free'd objects lying around?
I'm setting every reference to null once the object gets destroyed, so unless I missed something somewhere, no. That should show up in valgrind though which it didn't.

Is there something distinctive about the network traffic (buffer size, recovery cycle)?

The network traffic consists of raw data. So char arrays, (u)intX_t or packed (to remove padding) structs for more complex things, each packet has a header consisting of an id and the packet size itself which is validated against the expected size.
They are around 10-60bytes with the biggest (internal 'bootup' packet, fired once at startup) having a size of a few Mb.

Lots and lots of production asserts. Crash early and predictably before the damage propagates.

I once had a crash related to std::map corruption, each entity has a map of it's "view", each entity that can see it and vice versa is in that.
I added a 200byte buffer in front and after, filled it with 0x33 and checked it before each access. The corruption just magically vanished, I must've moved something around which made it corrupt something else.

Strategic logging, so you know accurately what was happening just before. Add to the logging as you get closer to an answer.

It works.. to an extend.

In desperation, can you save state and auto-restart? I can think of a few pieces of production software that do that.

I somewhat do that. The software consists of a main "cache" process and some other worker ones which all access the cache to get and save stuff. So per crash I don't lose much progress, it still disconnects all the users and so on, it's definitely not a solution.

Concurrency: threading, race conditions, etc

There's a mysql thread to do "async" queries, that's all untouched though and only shares information to the database class via functions with all lock.

Interrupts

There's an interrupt timer to prevent it from locking up that just aborts if it didn't complete a cycle for 30 seconds, that code should be safe though:

if (!tics) {
    abort();
} else
    tics = 0;

tics is volatile int tics = 0; which is increased each time a cycle is completed. Old code too.

events/callbacks/exceptions: corrupting state or the stack unpredictably

Lots of callbacks are being used (async network I/O, timers), but they shouldn't do anything bad.

Unusual data: unusual input data/timing/state

I've had a few edge cases related to that. Disconnecting a socket while packets are still being processed resulted in accessing a nullptr and such, but those have been easy to spot so far since every reference gets cleaned up right after telling the class itself it's done. (Destruction itself is handled by a loop deleting all the destroyed objects each cycle)

Dependency on an asynchronous external process.

Care to elaborate? This is somewhat the case, the cache process mentioned above.
Only thing I could imagine off the top of my head would be it not finishing quick enough and using garbage data, but that's not the case since that's using network too. Same packet model.

Best Answer

It's a challenging problem but I suspect there are a lot more clues to be found in the crashes you've already seen.

  • Are you keeping careful records of crashes, to look for patterns?
  • Are the few places really similar? In what way?
  • What does the corrupt data look like? Zeros? Ascii? Patterns?
  • Is there any multi-threading involved? Could it be a race condition?
  • Is it heap-related? Does the corruption happen after a free()?
  • Is it stack-related? Does the stack get corrupted?
  • Is a dangling reference a possibility? A data value that mysteriously changed?
  • Is there something distinctive about the network traffic (buffer size, recovery cycle)?

Things we have used in similar situations.

  • Lots and lots of production asserts. Crash early and predictably before the damage propagates.
  • Lots and lots of guards. Extra data items before and after local variables, objects and mallocs() set to a value and then checked often.
  • Strategic logging, so you know accurately what was happening just before. Add to the logging as you get closer to an answer.

In desperation, can you save state and auto-restart? I can think of a few pieces of production software that do that.

Feel free to add details if we can help at all.


Can I just add that seriously indeterminate bugs like this are not all that common, and there are not many things that can (usually) cause them. They include:

  • Concurrency: threading, race conditions, etc
  • Interrupts/events/callbacks/exceptions: corrupting state or the stack unpredictably
  • Unusual data: unsual input data/timing/state
  • Dependency on an asynchronous external process.

These are the parts of the code to focus on.

Related Topic