Garbage Collection Algorithms – How to Prevent Full Memory Scans

algorithmsgarbage-collection

Some (at least Mono's and .NET's) garbage collectors have a short term memory area which they scan often, and a secondary memory area which they scan less often. Mono calls this a nursery.

To find out which objects can be disposed of, they scan all objects starting from roots, the stack and the registers and dispose all objects that aren't being referenced anymore.

My question is how they prevent all in use memory from being scanned on every collect? In principle, the only way to find out what objects aren't in use anymore is to scan all objects and all their references. However, this would prevent the OS from swapping out memory even though it isn't in use by the application and feels like a huge amount of work that needs to be done, also for "Nursery Collection". It doesn't feel like they're winning much by using a nursery.

Am I missing something or is the garbage collector actually scanning every object and every reference every time it does a collection?

Best Answer

The fundamental observations which allow generational garbage-collection to avoid having to scan all older-generation objects are:

  1. After a collection, all objects that still exist will be of some minimum generation (e.g. in .net, after a Gen0 collection, all objects are Gen1 or Gen2; after a Gen1 or Gen2 collection, all objects are Gen2).
  2. An object, or portion thereof, which has not been written since a collection that promoted everything to generation N or higher cannot contain any references to objects of lower generations.
  3. If an object has reached a certain generation, it need not be identified as reachable to ensure its retention when collecting lower generations.

In many GC frameworks, it's possible for the garbage collector to flag objects or portions thereof in such a way that the first attempt to write to them will trigger special code to record the fact that they have been modified. An object or portion thereof which has been modified, regardless of its generation, must be scanned in the next collection, since it may contain references to newer objects. On the other hand, it's very common for there to be a lot of older objects that do not get modified between collections. The fact that lower-generation scans can ignore such objects can allow such scans to complete much more quickly than they otherwise would.

Note, btw, that even if one cannot detect when objects are modified and would have to scan everything on each GC pass, generational garbage collection could still improve the "sweep" stage performance of a compacting collector. In some embedded environments (especially those where there is little or no difference in speed between sequential and random memory accesses), moving blocks of memory around is relatively expensive compared to tagging references. Consequently, even if the "mark" phase can't be sped up using a generational collector, speeding up the "sweep" phase may be worthwhile.