Data Versioning – Ensuring Readability of Different Binary Data Format Versions

binaryconfigurationdatastorageversioning

On our project we have this data format that we use to process and record data on. As of late our application changed so that many of the data formats parameters have become obsolete. Currently we receive these data "packets" over internet (UDP or TCP) and from saved binary files.

We want to create a new more space efficient format, removing things we don't need. Each format is divided into a header and the payload, where things like time-stamp information and some description of the payload is in the header.

To ensure that we can support multiple versions of a format, we decided that it made sense to put some sort of format version ID at the top of the format for every format we make. Unfortunately the previous format (created by people who are no longer on our team) does not follow the convention, and at some point the decision was made to put the format version ID in the middle of the format, in between where all the now useless junk data was.

reading this older format is an issue because currently we actually have gigabytes of that formats data that we use as test data for our application, stuff that was collected in the field.

How do we both ensure formats that don't follow the format format version ID, everything else are still able to be read by our application and future format versions that we create?

We've considered the following:

  • Just moving on to the next format, ignoring old data. Not responsible, prohibitively expensive.

  • Having the user some how specify which format is which (formats which can be found out from header immediately vs old format types). Annoying, and hard on people who are not devs on this project but also contribute (of which there are many).

  • Having new format versions follow old version up to version ID portion. mitigates many of the benefits of moving to the new version, requires careful planning of where to place header bytes to ensure version ID is still in the same location (harder on developers).

  • Converting old formats to version ID first header versions, requires new tooling and maintaining of version converter, requires everyone else's files to be updated as well, these recorded files are with people who are not devs and aren't using version control either, so it will be difficult to make sure already recorded data can be correctly used for everyone.

Here is an example of what the current header looks like:

* = marked for removal

size: 8 bytes
payload metadata: 8 bytes
payload metadata: 8 bytes
* non-standard timeformat: 8 bytes 
* non-standard timeformat: 8 bytes
* legacy undocumented data: 8 bytes
version number: 8 bytes
* source metadata: 8 bytes // may not want this all the time
sequence number: 8 bytes
short range time: 8 bytes
payload metadata: 8 bytes
* size data?: 8 bytes
* spare data: 8 bytes
payload: N bytes

Best Answer

It seems to me that the simplest solution is to make your version header unambiguous and make sure that the old format can never look like it has a format header, you simply look for it. If it's not there you assume it's the old style and try to find it from the middle. There might also be things in the beginning of the old format that can clue you in.

The key here is that you need to find some sort of scheme for your version preamble that the old format cannot produce. For example, let's assume the old version never starts with a 0 byte. You could start your preamble with 0x00 0x00 0x00 0x00. Then when you start reading the data, you read in the first 4 bytes and if there's any non-zero value, you are looking at an old version (or a bad request.) An example of this being done is in UTF-8 and its backwards compatibility with ascii.

Related Topic