Java Object-Graphs – Storing with Class-Evolution and Transformation

file-storagejavaserializationstorageversioning

Abstract

A common problem is to store objects (with graph), and to load them back. This is easy as long the stored object representation matches the executing code. But as time goes by, requirements change and stored objects do not match any longer the code. The stored objects should not loose their data, and the code in the clients should work with the latest object-models.

So a transformation has to occur somehow between loading the data and returning a object to the client.

I know that there exist some libraries such as XStream, gson, protobuf and avro. They could load older objects, but afaik just ignore data that does not match any longer the fields in the class (maybe I missed something).
(When I'm talking about storing and serialization I do not mean Java's built in serialization mechanism.)

Question

So what is the question? I'm now researching for some time, and there seems no library that adresses this issue (class evolution) without data-loss. I hope to find another working solution or idea how to implement it by myself here.

I have some requirements:

  • File-based – I want to be able to store the serialized object on disk
  • Appendable – I want to append multiple objects to one file without loading the whole file in memory again and again
  • Support for multiple versions in one file – A file could contain objects with different version (only of the same type)
  • Transformation – Data should be accessible by using the same type, even when changed in between.
  • Generic – The mechanism itself has to be generic, so I could use it for different objects (different objects do not get mixed in one file, only different versions of one type).

It would be nice if the storing format is human-readable.

Already stored objects are not updateable (at least not without huge effort). Think of an long-term archive.

Example

I could give an example for better illustration. Let's suppose we have two Pojos that we want to serialize.

public class MyPojo {
    String text;
    Long number;
    Integer[] values;
    SubPojo pojo;
}

public class SubPojo {
    List<String> items;
}

In the next version we might have renamed a field (text->content), changed a type (Integer[]->List), and have transformed a field to List (SubPojo->List) where the previous field now is the first element of the new list (not loosing data, just transforming to new representation).

public class MyPojo {
    String content;
    Long number;
    List<Integer> values;
    List<SubPojo> pojo;
}

public class SubPojo {
    List<String> items;
}

Some pseudocode how the client might take use of this:

// Write
Serializer ser = new Serializer();
MyPojo pojo = new MyPojo();
pojo.xxx = ...; // set fields
ser.store(pojo, file, append);

// Read (a version later)
Serializer ser = new Serializer();
ser.registerTransformer(new TransformerV1ToV2());
List<MyPojo> pojos = ser.load(file);

This approach has some drawbacks:

  • The transformer have to work on some kind of intermediate format (could be the backed stored format such as json or xml)
  • You don't know how the class format was at some point in time, since you're only transforming relative to the previous version and mapping to the final class, making search for errors hard
  • Performance (depending on how the transformation happens)

Best Answer

I use the Externalizable interface to solve this problem. It doesn't really meet your Appendable requirement, but it might get you started.

Externalizable lets you write each object out yourself. What I do is include a version number for each class. Then, when I change the class, I adjust the readExternal method so it can can read the new format as well as the old ones and change the writeExternal method to write the new format with the new version number. The code to read old formats may need to change to fill new fields (and not fill in removed fields). Then I'm ready to go.

There's usually some sort of object hierarchy in the graph, so if a level 2 class is seriously modified, the new write/read for that class can output/input something completely different from the old one, with completely new lower-level classes.

Some care is needed because sometimes there isn't a hierarchy. You have to remember that, after reading, you can't set a field by pulling data from objects in fields because they may not really be there yet. (Sometimes I add a setUp method to all my classes that gets called after the top-level objectRead so the objects in the graph can get themselves organized after the whole graph has been read.

And it sometimes helps to write out objects manually using a method that just writes out plain bytes. (But on the read you have to know precisely what you are reading, where readExternal will read a complete object of any class.)

Code looks something like:

// version 3 -- November 16, 2013
// version 2 -- March 22, 2013
// version 1 -- April 1, 2012
public void writeExternal( ObjectOutput out )  throws IOException  {
    out.writeShort( 3 )
    out.writeLong( longData );
    out.writeObject( something );
    ....
}

public void readExternal( ObjectInput in )  throws IOException,
             ClassNotFoundException  {
    short version = in.readShort();
    if (version > 3)  {
        // Admit program is too old to read file.
    }
    else if (version == 3)  {
        longData = in.readLong();
        something = readobject();
    }
    else if (version == 2)  {
        longData = 5;
        something = readobject();
    }
    else
       ...
}