You could always just use SQL Server Reporting Services and Report Builder (MS's web based designer) or Report Designer (component of Visual studio). It's pretty easy to get this set up too.
There are a few end user reporting solutions around as well that make it easier to dynamically create reports, if you're willing to invest a bit of cash.
My company produce one: http://www.rsinteract.com has a very cheap standard edition with a limited number of reports (30 day free trial). It reports directly off SQL server with Reporting Services installed. It won best of TechEd 2006 - http://windowsitpro.com/article/articleid/53944/best-of-tech-ed-2006-winners.html
We actually use ours to analyse the support requests from clients i.e. which component is failing most, who reports the most bugs etc. Not tried it on test data.
There's also Proclarity, ApexSQL Report, and Tableau all of which are good.
I guess my short answer would be "don't". I'm a bit of a greybeard, and have done a lot of mapping XML data into relational databases. If you do decide to use such a database, you're going to have to validate your data constantly. You'll also need very strict discipline in order to avoid having databases with little commonality. Using a schema helps here, as most XML schemas are object-oriented and thus extensible, easing the need for analysis to keep from creating similar data with dissimilar names, which will cause anyone who has to access your database to think evil thoughts about you.
In my personal experience, if you're doing the sorts of things where a networked database makes sense, go for it. If not, you lose all the other things relational databases can do, like integrity checking, transactions and set selecting. However, since most people use a relational database as an object store anyway, I guess the point is moot.
As for how to access that data, just put it in a Hashtable. Seriously. If there is no schema anywhere, then you'll never know what is in there. If you have a schema, you can use that to generate accessor objects, but you gain little, as you lose all the flexibility of the underlying store while simultaneously gaining the inflexibility of a DAO (Data Access Object).
For instance, if you have a Hashtable, getting the values out of an XML parser is often fairly easy. You define the storage types you're going to use, then you walk the XML tree and put the values in the storage types, storing the types in either a Hashtable or List as appropriate. If, however, you use a DAO, you end up not being able to trivially extend the data object, one of the strengths of XML, and you have to create getters and setters for the object that do
public void setter(Element e) throws NoSuchElementException {
try {
this.Name = e.getChild("Name").getValue();
} catch (Exception ex) {
throw new NoSuchElementException("Element not found for Name: "+ex.getMessage());
}
}
Except, of course, you have to do it for every single value in that schema layer, including loaders and definitions for sublayers. And, of course, you end up with a much bigger mess if you use the faster parsers that employ callbacks, as you now have to track which object your'e in as you produce the resultant tree.
I've done all this, although I normally construct a validator, then an adapter that provides the match between the XML and the data class, then a reconcile process to reconcile it to the database. Almost all the code ends up being generated, though. If you have the DTD, you can generate most of the Java code to access it, and do so with reasonable performance.
In the end, I'd simply keep freeform, networked or hierarchical data as freeform, networked or hierarchical data.
Best Answer
I don't think there is any DBMS that will do what you want and allow an off-the-shelf reporting tool to be used. Low-latency analytic systems are not quick and easy to build. Low-latency on unstructured data is quite ambitious.
You are going to have to persist the data in some sort of database, though.
I think you may have to take a closer look at your problem domain. Are you trying to run low-latency analytical reports, or an operational report that prompts some action within the business when certain events occur? For low-latency systems you need to be quite ruthless about what constitutes operational reporting and what constitutes analytics.
Edit: Discourage the 'potentially both' mindset unless the business are prepared to pay. Investment banks and hedge funds spend big bucks and purchase supercomputers to do 'real-time analytics'. It's not a trivial undertaking. It's even less trivial when you try to do such a system and build it for high uptimes.
Even on apps like premium-rate SMS services and .com applications the business often backs down when you do a realistic scope and cost analysis of the problem. I can't say this enough. Be really, really ruthless about 'realtime' requirements.
If the business really, really need realtime analytics then you can make hybrid OLAP architectures where you have a marching lead partition on the fact table. This is an architecture where the fact table or cube is fully indexed for historical data but has a small leading partition that is not indexed and thus relatively quick to insert data into.
Analytic queries will table scan the relatively small leading data partition and use more efficient methods on the other partitions. This gives you low latency data and the ability to run efficient analytic queries over the historical data.
Run a process nightly that rolls over to a new leading partition and consolidates/indexes the previous lead partition.
This works well where you have items such as bitmap indexes (on databases) or materialised aggregations (on cubes) that are expensive on inserts. The lead partition is relatively small and cheap to table scan but efficient to trickle insert into. The roll-over process incrementally consolidates this lead partition into the indexed historical data which allows it to be queried efficiently for reports.
Edit 2: The common fields might be candidates to set up as dimensions on a fact table (e.g. caller, time). The less common fields are (presumably) coding. For an efficient schema you could move the optional coding into one or more 'junk' dimensions..
Briefly, a junk dimension is one that represents every existing combination of two or more codes. A row on the table doesn't relate to a single system entity but to a unique combination of coding. Each row on the dimension table corresponds to a distinct combination that occurs in the raw data.
In order to have any analytic value you are still going to have to organise the data so that the columns in the junk dimension contain something consistently meaningful. This goes back to some requirements work to make sure that the mappings from the source data make sense. You can deal with items that are not always recorded by using a placeholder value such as a zero-length string (
''
), which is probably better than nulls.