R – schema-less data warehouse and reporting

business-intelligencereportingschemaless

We have a system that generates many events as the result of a phone call/web request/sms/email etc, each of these events need to be able to be stored and be available for reporting (for MI/BI etc) on, each of these events have many variables and does not fit any one specific scheme.

The structure of the event document is a key-value pair list (cdr= 1&name=Paul&duration=123&postcode=l21). Currently we have a SQL Server system using dynamically generated sparse columns to store our (flat) document, of which we have reports that run against the data, for many different reasons I am looking at other solutions.

I am looking for suggestions of a system (open or closed) that allows us to push these events in (regardless of the schema) and provide reporting and anlytics on top of it.

I have seen Pentaho and Jasper, but most of the seem to connect to a system to get the data out of it to then report on it. I really just want to be able to push a document in and have it available to be reported on.

As much as I love CouchDB, I am looking for a system that allows schema-less submitting of data and reporting on top of it (much like Pentaho, Jasper, SQL Reporting/Analytics Server etc)

Best Answer

I don't think there is any DBMS that will do what you want and allow an off-the-shelf reporting tool to be used. Low-latency analytic systems are not quick and easy to build. Low-latency on unstructured data is quite ambitious.

You are going to have to persist the data in some sort of database, though.

I think you may have to take a closer look at your problem domain. Are you trying to run low-latency analytical reports, or an operational report that prompts some action within the business when certain events occur? For low-latency systems you need to be quite ruthless about what constitutes operational reporting and what constitutes analytics.

Edit: Discourage the 'potentially both' mindset unless the business are prepared to pay. Investment banks and hedge funds spend big bucks and purchase supercomputers to do 'real-time analytics'. It's not a trivial undertaking. It's even less trivial when you try to do such a system and build it for high uptimes.

Even on apps like premium-rate SMS services and .com applications the business often backs down when you do a realistic scope and cost analysis of the problem. I can't say this enough. Be really, really ruthless about 'realtime' requirements.

If the business really, really need realtime analytics then you can make hybrid OLAP architectures where you have a marching lead partition on the fact table. This is an architecture where the fact table or cube is fully indexed for historical data but has a small leading partition that is not indexed and thus relatively quick to insert data into.

Analytic queries will table scan the relatively small leading data partition and use more efficient methods on the other partitions. This gives you low latency data and the ability to run efficient analytic queries over the historical data.

Run a process nightly that rolls over to a new leading partition and consolidates/indexes the previous lead partition.

This works well where you have items such as bitmap indexes (on databases) or materialised aggregations (on cubes) that are expensive on inserts. The lead partition is relatively small and cheap to table scan but efficient to trickle insert into. The roll-over process incrementally consolidates this lead partition into the indexed historical data which allows it to be queried efficiently for reports.

Edit 2: The common fields might be candidates to set up as dimensions on a fact table (e.g. caller, time). The less common fields are (presumably) coding. For an efficient schema you could move the optional coding into one or more 'junk' dimensions..

Briefly, a junk dimension is one that represents every existing combination of two or more codes. A row on the table doesn't relate to a single system entity but to a unique combination of coding. Each row on the dimension table corresponds to a distinct combination that occurs in the raw data.

In order to have any analytic value you are still going to have to organise the data so that the columns in the junk dimension contain something consistently meaningful. This goes back to some requirements work to make sure that the mappings from the source data make sense. You can deal with items that are not always recorded by using a placeholder value such as a zero-length string (''), which is probably better than nulls.

Related Solutions

R – Test Reporting

You could always just use SQL Server Reporting Services and Report Builder (MS's web based designer) or Report Designer (component of Visual studio). It's pretty easy to get this set up too.

Report Builder: http://msdn.microsoft.com/en-us/library/ms155933.aspx
Report Designer: http://msdn.microsoft.com/en-us/library/ms157166.aspx
Tutorial: http://www.simple-talk.com/sql/learn-sql-server/beginning-sql-server-2005-reporting-services-part-1/
How to add Reporting Services to an existing SQL Server: http://www.mssqltips.com/tip.asp?tip=1444

There are a few end user reporting solutions around as well that make it easier to dynamically create reports, if you're willing to invest a bit of cash.

My company produce one: http://www.rsinteract.com has a very cheap standard edition with a limited number of reports (30 day free trial). It reports directly off SQL server with Reporting Services installed. It won best of TechEd 2006 - http://windowsitpro.com/article/articleid/53944/best-of-tech-ed-2006-winners.html

We actually use ours to analyse the support requests from clients i.e. which component is failing most, who reports the most bugs etc. Not tried it on test data.

There's also Proclarity, ApexSQL Report, and Tableau all of which are good.

R – Best practices to access schema-less data

I guess my short answer would be "don't". I'm a bit of a greybeard, and have done a lot of mapping XML data into relational databases. If you do decide to use such a database, you're going to have to validate your data constantly. You'll also need very strict discipline in order to avoid having databases with little commonality. Using a schema helps here, as most XML schemas are object-oriented and thus extensible, easing the need for analysis to keep from creating similar data with dissimilar names, which will cause anyone who has to access your database to think evil thoughts about you.

In my personal experience, if you're doing the sorts of things where a networked database makes sense, go for it. If not, you lose all the other things relational databases can do, like integrity checking, transactions and set selecting. However, since most people use a relational database as an object store anyway, I guess the point is moot.

As for how to access that data, just put it in a Hashtable. Seriously. If there is no schema anywhere, then you'll never know what is in there. If you have a schema, you can use that to generate accessor objects, but you gain little, as you lose all the flexibility of the underlying store while simultaneously gaining the inflexibility of a DAO (Data Access Object).

For instance, if you have a Hashtable, getting the values out of an XML parser is often fairly easy. You define the storage types you're going to use, then you walk the XML tree and put the values in the storage types, storing the types in either a Hashtable or List as appropriate. If, however, you use a DAO, you end up not being able to trivially extend the data object, one of the strengths of XML, and you have to create getters and setters for the object that do

public void setter(Element e) throws NoSuchElementException {
    try {
        this.Name = e.getChild("Name").getValue();
    } catch (Exception ex) {
        throw new NoSuchElementException("Element not found for Name: "+ex.getMessage());
    }
}

Except, of course, you have to do it for every single value in that schema layer, including loaders and definitions for sublayers. And, of course, you end up with a much bigger mess if you use the faster parsers that employ callbacks, as you now have to track which object your'e in as you produce the resultant tree.

I've done all this, although I normally construct a validator, then an adapter that provides the match between the XML and the data class, then a reconcile process to reconcile it to the database. Almost all the code ends up being generated, though. If you have the DTD, you can generate most of the Java code to access it, and do so with reasonable performance.

In the end, I'd simply keep freeform, networked or hierarchical data as freeform, networked or hierarchical data.

Best Answer

Related Solutions

R – Test Reporting

R – Best practices to access schema-less data

Related Topic