Big Data Architecture – Starting an Analytics System for Large Data Processing

Architecturebig datadatadata-warehouse

Imagine you're writing some sort of Web Analytics system – you're recording raw page hits along with some extra things like tagging cookies etc and then producing stats such as

  • Which pages got most traffic over a time period
  • Which referers sent most traffic
  • Goals completed (goal being a view of a particular page)
  • And more advanced things like which referers sent the most number of vistors who later hit a goal.

The naieve way of approaching this would be to throw it in a relational database and run queries over it – but that won't scale.

You could pre-calculate everything (have a queue of incoming 'hits' and use to update report tables) – but what if you later change a goal – how could you efficiently re-calculate just the data that would be effected.

Obviously this has been done before 😉 so any tips on where to start, methods & examples, architecture, technologies etc.

Best Answer

First, don't assume that a RDMS isn't going to scale. It might not be the right solution, but saying it won't scale doesn't make sense unless you've considered how your data is going to come into the system, how it's going to be queried and what you eventually want to see from those queries.

Recording raw page hits may or may not be a large dataset. If you do this in a simple naive way, recording a row for every single hit, it may not scale, but this isn't necessarily the smartest way to record these things. You are likely going to be working from server logs, which will then distill them into an aggregated form.

Path tracking is likely to be the largest dataset here since you'll need the breadcrumbs from each individual user, but the querying part is important here. To do this in a sophisticated way, you'll likely be using some application logic, not a raw query.

Unless you have a large number of users, a single RDMS should be able to handle these queries. The general idea is to keep both the aggregate data and the raw, fine grained level data in different tables. The aggregates provide fast queries with indexes, etc. and the fine grained data can be used to be build new metrics.

Some databases and some BI solutions provide automated ways to do this. Oracle has aggregate persistence for example, but in my work I've found myself writing batch jobs to build aggregates.

Longer term, you'll want to learn about modeling your data dimensionally rather than relationally. Dimensional models and star schemas are more extensible than a relational model you replicate from a production system and provide a better way to manage the granularity of the data and cached aggregates.

If you have really large datasets, then you'll need to start thinking about using distributed processing, map/reduce, etc. But you'll save yourself a lot of time if you can manage to use a traditional database in an efficient manner. Performing complex analytics (i.e., more than simple aggregates such as SUM or AVE) requires a lot of more thinking and expertise in the map/reduce framework than in SQL.

Related Topic