Java Object-Graphs – Storing with Class-Evolution and Transformation

file-storagejavaserializationstorageversioning

Abstract

A common problem is to store objects (with graph), and to load them back. This is easy as long the stored object representation matches the executing code. But as time goes by, requirements change and stored objects do not match any longer the code. The stored objects should not loose their data, and the code in the clients should work with the latest object-models.

So a transformation has to occur somehow between loading the data and returning a object to the client.

I know that there exist some libraries such as XStream, gson, protobuf and avro. They could load older objects, but afaik just ignore data that does not match any longer the fields in the class (maybe I missed something).
(When I'm talking about storing and serialization I do not mean Java's built in serialization mechanism.)

Question

So what is the question? I'm now researching for some time, and there seems no library that adresses this issue (class evolution) without data-loss. I hope to find another working solution or idea how to implement it by myself here.

I have some requirements:

File-based – I want to be able to store the serialized object on disk
Appendable – I want to append multiple objects to one file without loading the whole file in memory again and again
Support for multiple versions in one file – A file could contain objects with different version (only of the same type)
Transformation – Data should be accessible by using the same type, even when changed in between.
Generic – The mechanism itself has to be generic, so I could use it for different objects (different objects do not get mixed in one file, only different versions of one type).

It would be nice if the storing format is human-readable.

Already stored objects are not updateable (at least not without huge effort). Think of an long-term archive.

Example

I could give an example for better illustration. Let's suppose we have two Pojos that we want to serialize.

public class MyPojo {
    String text;
    Long number;
    Integer[] values;
    SubPojo pojo;
}

public class SubPojo {
    List<String> items;
}

In the next version we might have renamed a field (text->content), changed a type (Integer[]->List), and have transformed a field to List (SubPojo->List) where the previous field now is the first element of the new list (not loosing data, just transforming to new representation).

public class MyPojo {
    String content;
    Long number;
    List<Integer> values;
    List<SubPojo> pojo;
}

public class SubPojo {
    List<String> items;
}

Some pseudocode how the client might take use of this:

// Write
Serializer ser = new Serializer();
MyPojo pojo = new MyPojo();
pojo.xxx = ...; // set fields
ser.store(pojo, file, append);

// Read (a version later)
Serializer ser = new Serializer();
ser.registerTransformer(new TransformerV1ToV2());
List<MyPojo> pojos = ser.load(file);

This approach has some drawbacks:

The transformer have to work on some kind of intermediate format (could be the backed stored format such as json or xml)
You don't know how the class format was at some point in time, since you're only transforming relative to the previous version and mapping to the final class, making search for errors hard
Performance (depending on how the transformation happens)

Best Answer

I use the Externalizable interface to solve this problem. It doesn't really meet your Appendable requirement, but it might get you started.

Externalizable lets you write each object out yourself. What I do is include a version number for each class. Then, when I change the class, I adjust the readExternal method so it can can read the new format as well as the old ones and change the writeExternal method to write the new format with the new version number. The code to read old formats may need to change to fill new fields (and not fill in removed fields). Then I'm ready to go.

There's usually some sort of object hierarchy in the graph, so if a level 2 class is seriously modified, the new write/read for that class can output/input something completely different from the old one, with completely new lower-level classes.

Some care is needed because sometimes there isn't a hierarchy. You have to remember that, after reading, you can't set a field by pulling data from objects in fields because they may not really be there yet. (Sometimes I add a setUp method to all my classes that gets called after the top-level objectRead so the objects in the graph can get themselves organized after the whole graph has been read.

And it sometimes helps to write out objects manually using a method that just writes out plain bytes. (But on the read you have to know precisely what you are reading, where readExternal will read a complete object of any class.)

Code looks something like:

// version 3 -- November 16, 2013
// version 2 -- March 22, 2013
// version 1 -- April 1, 2012
public void writeExternal( ObjectOutput out )  throws IOException  {
    out.writeShort( 3 )
    out.writeLong( longData );
    out.writeObject( something );
    ....
}

public void readExternal( ObjectInput in )  throws IOException,
             ClassNotFoundException  {
    short version = in.readShort();
    if (version > 3)  {
        // Admit program is too old to read file.
    }
    else if (version == 3)  {
        longData = in.readLong();
        something = readobject();
    }
    else if (version == 2)  {
        longData = 5;
        something = readobject();
    }
    else
       ...
}

Related Solutions

Java – SQL RDBMS : one query or multiple calls

You should put correctness first. Create your data structures so that they model the domain in question in a correct and effective way that makes it easy for your code to work with.

Beyond that, try to minimize database calls, especially if the database is not local (residing on the same machine as the program calling it). Network latency is a real consideration here, and it can be non-trivial.

Let's say you have an operation that requires 10 database calls. If your network latency is 100 ms, this operation will take 1 second of pure overhead just communicating with the server, in addition to whatever amount of time it takes to actually do the work involved. If your latency is 1 second, it will waste 10 seconds on network latency alone. But if you get that down from 10 calls to 1, suddenly even in really ugly latency conditions, you're not wasting much time on network overhead.

As a general rule of thumb, if you're just retrieving data simply (and not doing heavy processing of the data inside the database server or on the client), the biggest bottleneck by far in a system with a non-local database will be network latency. So if you can reduce the number of calls, even if it means you need to do extra work on the client side once you've retrieved it, you'll still probably come out ahead.

As always, remember the most important rule of optimization: measure first! Optimize by hard data, not by rules of thumb like the one I just described, or you could easily end up doing a lot of hard work that slows things down! But in general, keeping the number of queries down is usually the best route.

Reviewing Object-Oriented Parser Models in Java

Data is now (after parsing) stored in just a list, searching for a specific peace of data is not very well optimized. Are there other structures I could use, which I can search using file(string) - level(integer) - sub-level(integer) so I can quickly get a specific data object.

Assuming that you want to search by File, Level, Sub-Level only, you have a clear hierarchical structure in that description, it sounds like you can divide one ArrayList into multiple steps. You could even make each hierarchy level a class itself to make it more clear, for example:

class SomeFileData {
   List<LevelData> levels;
}

class LevelData {
   List<SubLevelData> sublevels;
}

class SubLevelData {
   // Probably similar to your current `FileData` implementation
}

And if you need to save multiple files at once, you can use a Map with the file name as key.

class LotsOfFileData {
    Map<String, SomeFileData> files; 
}

Using this hierarchy, you can easily get file "mydata.dat", level 3, sublevel 4.

LotsOfFileData allData;
allData.getFile("mydata.dat").getLevel(3).getSublevel(4)

Each step in the get process looks at the current class' list or map to retrieve the data.

By the way, declare variables according to their interface, not their implementation. Declare your ArrayLists as List to allow for easy change of implementation.