Internal Mutation of Persistent Data Structures in Functional Programming

data structuresfunctional programmingimmutabilitymultithreading

To clarify, when I mean use the terms persistent and immutable on a data structure, I mean that:

  1. The state of the data structure remains unchanged for its lifetime. It always holds the same data, and the same operations always produce the same results.
  2. The data structure allows Add, Remove, and similar methods that return new objects of its kind, modified as instructed, that may or may not share some of the data of the original object.

However, while a data structure may seem to the user as persistent, it may do other things under the hood. To be sure, all data structures are, internally, at least somewhere, based on mutable storage.

If I were to base a persistent vector on an array, and copy it whenever Add is invoked, it would still be persistent, as long as I modify only locally created arrays.

However, sometimes, you can greatly increase performance by mutating a data structure under the hood. In more, say, insidious, dangerous, and destructive ways. Ways that might leave the abstraction untouched, not letting the user know anything has changed about the data structure, but being critical in the implementation level.

For example, let's say that we have a class called ArrayVector implemented using an array. Whenever you invoke Add, you get a ArrayVector build on top of a newly allocated array that has an additional item. A sequence of such updates will involve n array copies and allocations.

Here is an illustration:
enter image description here
However, let's say we implement a lazy mechanism that stores all sorts of updates — such as Add, Set, and others in a queue. In this case, each update requires constant time (adding an item to a queue), and no array allocation is involved.

When a user tries to get an item in the array, all the queued modifications are applied under the hood, requiring a single array allocation and copy (since we know exactly what data the final array will hold, and how big it will be). Future get operations will be performed on an empty cache, so they will take a single operation. But in order to implement this, we need to 'switch' or mutate the internal array to the new one, and empty the cache — a very dangerous action.

However, considering that in many circumstances (most updates are going to occur in sequence, after all), this can save a lot of time and memory, it might be worth it — you will need to ensure exclusive access to the internal state, of course.

This isn't a question about the efficacy of such a data structure. It's a more general question. Is it ever acceptable to mutate the internal state of a supposedly persistent or immutable object in destructive and dangerous ways? Does performance justify it? Would you still be able to call it immutable?

Oh, and could you implement this sort of laziness without mutating the data structure in the specified fashion?

Best Answer

I would be very loath to call a data structure "immutable" unless, once it was exposed to the outside world, unless all changes that are made to its internal state continue at all times to leave the object in the same observable state, and unless the state of the object would be valid with any arbitrary combination of those changes either happening or not happening.

An example of a reasonably-good "immutable" object which obeys this principle is the Java string type. It includes a hash-code field which is initially zero, but which is used to memoize the result of querying the hash code. The state of a string with a zero hash-code field is semantically the same as that of a string where the hash-code field is filled in. If two threads simultaneously attempt to query the hash code, it's possible that both may end up performing the computation and storing the result, but it doesn't matter because neither store will affect the observable state of the object. If a third thread comes along and queries the hash code, it might or might not see the stores from the first two, but the returned hash code will be the same regardless.

(BTW, my one quibble with Java's string-hashing method is that it's possible for the hash function to return zero for a non-null string. I would have thought it better to have the hash function test for zero and substitute something else. For example, if the hashing step is such that adding a single character to a string whose hash is zero will always yield a non-zero hash, simply return the hash of the string minus the last character. Otherwise the worst-case time to hash a long string thousands of times may be much worse than the normal-case time.)

The big things to beware of are (1) sequences of operations which change the state of an object and then change it back, or (2) substitution of objects that appear to have the same state, but don't. Ironically, Microsoft's fixing of what it regarded as bug in its default Struct.Equals method makes #2 harder to protect against. If one has a number of immutable objects which hold references to what appear to be identical immutable objects, replacing all those references with references to one of those immutable objects should be safe. Unfortunately, the Equals overrides for system types Decimal, Double, and Float will sometimes report true even when comparing slightly-different values. It used to be that wrapping one of those types in a struct and calling Equals on that struct would test for true equivalence, but Microsoft changed things so a struct will report Equals if its members do, even if those members hold non-equivalent values of the aforementioned types.

Related Topic