XML – Are XML Schemas Suitable for Evolving File Formats?

database-designjavaschemaxmlxslt

I'm struggling with a client-server project where I have Java apps out on the Internet that store data to a backend server. The format of this data is well-defined, but the project is constantly evolving, so the definition keeps changing! To cope with the change I defined a simple REST interface on the server that offers only key-value storage. Clients can store or retrieve a chunk of data by referencing a unique key. This is nice because I don't have to modify the server interface (or the backend database) when the data format changes. To the server, it's just a bunch of opaque blobs.

Of course, the issue then becomes, "What goes inside the blob?" For that I wrote an XML Schema that defines the content of a blob. At first it was great, since the Schema gives a bunch of nice things "for free": A formal yet human-readable spec of the file format, automatic validation of its contents, marshalling/unmarshalling to a stream, and auto-generated Java classes for programmatic access to the data.

But then change happened! The Schema had to be altered, and naturally I ran into forward- and backward-compatibility issues. To deal with the constantly changing Schema, I came up with a solution that embeds a version number into the XML namespace, and I apply a series of XSL Stylesheets to "upgrade" any given blob to the latest version. For example, I'm now on version 1.3 of my Schema, so when I unmarshal a blob, I run it through a 1.0-to-1.1 XSLT, then a 1.1-to-1.2 XSLT, and finally a 1.2-to-1.3 XSLT. This works, but it's not sustainable because the chain keeps getting longer, which lowers performance and sucks up memory, plus I have to keep writing new Stylesheets, which takes time and isn't fun.

Now here's the funny thing… In addition to the Java clients, the project also has iOS apps as clients, and iOS has none of the nice enterprise-y features associated with XML Schemas. There's no validation of the stream, no auto-generation of Objective-C classes, etc., just a low-level event-driven XML parser. But ironically I'm finding this so much easier! For example, if the XML gets a new element, I just add a new if clause. If an element goes away, I remove its clause. Basically, I do a "best effort" at interpreting the XML stream, silently ignoring any unrecognized elements. I don't need to think about what version the file format is or whether it's valid. Plus this is much faster because there's no XSLT chaining, and it saves a lot of my time because I don't have to write any XSLT code.

So far this approach has worked out great, and I've not missed having an XML Schema on the iOS side. I'm now wondering if a Schema, despite its nice feature set, is totally the wrong technology for a file format that often changes. I'm thinking about ditching my XML Schema altogether and using the same "best effort" low-level approach in Java that I'm doing in iOS.

So is my negative assessment of XML Schemas correct? Or is there something I've missed? Perhaps I need to rethink the server interface? Or maybe I shouldn't have been using XML in the first place? I'm open to all suggestions. Thanks for reading!

Best Answer

I think you are really asking a broader question, "is having a strict definition of a file format a good thing for a rapidly evolving project".

To answer your immediate question, though: yes, they are. The XML schema gives you a strict definition of the format, answer a lot of questions about validity, provides great documentation, and allows you to confidently know that a specific version of the document has a specific form.

They are not the be-all and end-all, though: they define structure, not semantics, so you can still change the "meaning" of a tag between version of the document without having to change the schema. That causes just as much trouble.

To answer the question I think you are asking: yes, the XML schema is a good thing.

It is forcing you to address a painful fact, which is that your data exchange is constantly changing versions, and that means you have to adapt your system to account for that.

If you only had the IOS model, where you take a "rough guess" at what this version means, you open the door to all sorts of trouble in the long run. For example, it becomes trivial for someone to assume that "element foo being present means version 1.2, so tag bar means ...".

That is great, right until version 2.0 adds back the foo element, with a different meaning, and tag bar isn't even there. Welcome to "inconsistent behaviour" city.

If you use XML without schemas, or JSON, or something else that doesn't impose that cost on you then only a tiny bit of the problem goes away. You still have to deal with all four versions of the input, but you have less tools to help you out.

You should, in my opinion, generally prefer to make the pain of changes proportional to their real, long term cost. Changing the data exchange format has a high long term cost - you have to deal with compatibility, with data upgrades, and that sort of thing.

If that costs little you will be tempted to do it a lot, and then will pay the maintenance cost tomorrow. If it costs more now, you might think harder - can you do more than one thing in this change? Can you get away without it? Can you do it smarter?

In short, I think your real problem is that your file format changes often, not that you used (or didn't use) XML schemas.

(Also, are your users really happy if the server or client randomly drops content that is represented in the newer version? I would be surprised if that remains true forever - another of those long term costs you need to recognise somewhere...)

Related Topic