R – Best practices to access schema-less data

ormschemaless

I am toying with RDF, and in particular how to access information stored in a rdf storage. The huge difference from a traditional relational database is the lack of a predefined schema: in a relational database, you know that table has those columns, and you can technically map each row to an instance of a class. The class has well defined methods, and well defined attributes.

In a schema-less system, you don't know what data is associated to a given information. It's like having a database table with an arbitrary and not predefined number of columns, and every row can have data in any number of these columns.

Similar to ObjectRelational Mappers, there are Object RDF mappers. RDFAlchemy and SuRF are the two I am playing right now. Basically, they provide you a Resource object, whose methods and attributes are provided dynamically. It kind of make sense… however, it's not that easy. In many cases, you prefer to have a well defined interface, and to have more control of what's going on when you set and get data on your model object. Having such a generic access makes things difficult, in some sense.

Another thing (and most important) I noted is that, even if in general, schema-less data are expected to provide arbitrary information about a resource, in practice you more or less know "classes of information" that tend to be together. Of course, you cannot exclude the presence of additional info, but this, in some cases, is the exception, rather than the norm, although the exception is sensible enough to be too disruptive for a strict schema. In a rdf representation of an article (e.g. like in RSS/ATOM feeds) you know the terms of your described resources, and you can map them to a well defined object. If you provide additional information, you can define an extended object (inherited from the base one) to provide accessors to the enhanced information. So in a sense, you deal with schema-less data by means of "schema oriented objects" you can extend when you want to see specific additional information you are interested about.

My question is relative to your experience about real world usage practices of schema-less data storage. How do they map to the object-oriented world so that you can use it proficiently and without going too near to the "bare metal" of the schema-less storage ? (in RelDB terms, without using too much SQL and directly messing with the table structure)

Is the access doomed to be very generic (e.g. SuRF "plugged-in attributes" is the highest, most specialized level you can have to access your data), or having specialized classes for specific agreed convenient schemas is also a good approach, introducing however the risk of having a proliferation of classes to access new and unexpected associated data ?

Best Answer

I guess my short answer would be "don't". I'm a bit of a greybeard, and have done a lot of mapping XML data into relational databases. If you do decide to use such a database, you're going to have to validate your data constantly. You'll also need very strict discipline in order to avoid having databases with little commonality. Using a schema helps here, as most XML schemas are object-oriented and thus extensible, easing the need for analysis to keep from creating similar data with dissimilar names, which will cause anyone who has to access your database to think evil thoughts about you.

In my personal experience, if you're doing the sorts of things where a networked database makes sense, go for it. If not, you lose all the other things relational databases can do, like integrity checking, transactions and set selecting. However, since most people use a relational database as an object store anyway, I guess the point is moot.

As for how to access that data, just put it in a Hashtable. Seriously. If there is no schema anywhere, then you'll never know what is in there. If you have a schema, you can use that to generate accessor objects, but you gain little, as you lose all the flexibility of the underlying store while simultaneously gaining the inflexibility of a DAO (Data Access Object).

For instance, if you have a Hashtable, getting the values out of an XML parser is often fairly easy. You define the storage types you're going to use, then you walk the XML tree and put the values in the storage types, storing the types in either a Hashtable or List as appropriate. If, however, you use a DAO, you end up not being able to trivially extend the data object, one of the strengths of XML, and you have to create getters and setters for the object that do

public void setter(Element e) throws NoSuchElementException {
    try {
        this.Name = e.getChild("Name").getValue();
    } catch (Exception ex) {
        throw new NoSuchElementException("Element not found for Name: "+ex.getMessage());
    }
}

Except, of course, you have to do it for every single value in that schema layer, including loaders and definitions for sublayers. And, of course, you end up with a much bigger mess if you use the faster parsers that employ callbacks, as you now have to track which object your'e in as you produce the resultant tree.

I've done all this, although I normally construct a validator, then an adapter that provides the match between the XML and the data class, then a reconcile process to reconcile it to the database. Almost all the code ends up being generated, though. If you have the DTD, you can generate most of the Java code to access it, and do so with reasonable performance.

In the end, I'd simply keep freeform, networked or hierarchical data as freeform, networked or hierarchical data.

Related Topic