As always, knowing the structure of the underlying transaction data--the atomic components used to build a DW--is the first and biggest step.
There are essentially two options, based on how you retrieve the data. One of these, already mentioned in a prior answer to this question, is to access your GA data via the GA API. This is pretty close to the form that the data appears in the GA Report, rather than transactional data. The advantage of using this as your data source is that your "ETL" is very simple, just parsing the data from the XML container is about all that's needed.
The second option involves grabbing the data much closer to the source.
Nothing complicated, still, a few lines of background are perhaps helpful here.
The GA Web Dashboard is created by
parsing/filtering a GA transaction log
(the container
that holds the GA data that
corresponds to one Profile in one
Account).
Each line in this log represents a
single transaction and is delivered
to the GA server in the form of an
HTTP Request from the client.
Appended to that Request (which is
nominally for a single-pixel GIF) is
a single string that contains all of
the data returned from that
_TrackPageview function call plus data from the client DOM, GA cookies
set for this client, and the
contents of the Browser's location
bar (http://www....).
Though this Request is from the
client, it is invoked by the GA
script (which resides on the client)
immediately after execution of GA's primary
data-collecting function
(_TrackPageview).
So working directly with this transaction data is probably the most natural way to build a Data Warehouse; another advantage is that you avoid the additional overhead of an intermediate API).
The individual lines of the GA log are not normally avaialble to GA users. Still, it's simple to get them. These two steps should suffice:
modify the GA tracking code on each page of your Site so that it
sends a copy of each GIF Request
(one line in the GA logfile) to your
own server, specifically,
immeidately before the call to
_trackPageview(), add this line:
pageTracker._setLocalRemoteServerMode();
Next, just put a single-pixel gif
image in your document root and call
it "__utm.gif".
So now your server activity log will contain these individual transction lines, again built from a string appended to an HTTP Request for the GA tracking pixel as well as from other data in the Request (e.g., the User Agent string). This former string is just a concatenation of key-value pairs, each key begins with the letters "utm" (probably for "urching tracker"). Not every utm parameter appears in every GIF Request, several of them, for instance, are used only for e-commerce transactions--it depends on the transaction.
Here's an actual GIF Request (account ID has been sanitized, otherwise it's intact):
http://www.google-analytics.com/__utm.gif?utmwv=1&utmn=1669045322&utmcs=UTF-8&utmsr=1280x800&utmsc=24-bit&utmul=en-us&utmje=1&utmfl=10.0%20r45&utmcn=1&utmdt=Position%20Listings%20%7C%20Linden%20Lab&utmhn=lindenlab.hrmdirect.com&utmr=http://lindenlab.com/employment&utmp=/employment/openings.php?sort=da&&utmac=UA-XXXXXX-X&utmcc=__utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B
As you can see, this string is comprised of a set of key-value pairs each separated by an "&". Just two trivial steps: (i) Splitting this string on the ampersand; and (ii) replacing each gif parameter (key) with a short descriptive phrase, make this much easier to read:
gatc_version 1
GIF_req_unique_id 1669045322
language_encoding UTF-8
screen_resolution 1280x800
screen_color_depth 24-bit
browser_language en-us
java_enabled 1
flash_version 10.0%20r45
campaign_session_new 1
page_title Position%20Listings%20%7C%20Linden%20Lab
host_name lindenlab.hrmdirect.com
referral_url http://lindenlab.com/employment
page_request /employment/openings.php?sort=da
account_string UA-XXXXXX-X
cookies __utma%3D87045125.1669045322.1274256051.1274256051.1274256051.1%3B%2B__utmb%3D87045125%3B%2B__utmc%3D87045125%3B%2B__utmz%3D87045125.1274256051.1.1.utmccn%3D(referral)%7Cutmcsr%3Dlindenlab.com%7Cutmcct%3D%2Femployment%7Cutmcmd%3Dreferral%3B%2B
The cookies are also simple to parse (see Google's concise description here): for instance,
__utma is the unique-visitor cookie,
__utmb, __utmc are session cookies, and
__utmz is the referral type.
The GA cookies store the majority of the data that record each interaction by a user (e.g., clicking a tagged download link, clicking a link to another page on the Site, subsequent visit the next day, etc.). So for instance, the __utma cookie is comprised of a groups of integers, each group separated by a "."; the last group is the visit count for that user (a "1" in this case).
If you're using ng-view
in your Angular app you can listen for the $viewContentLoaded
event and push a tracking event to Google Analytics.
Assuming you've set up your tracking code in your main index.html file with a name of var _gaq
and MyCtrl is what you've defined in the ng-controller
directive.
function MyCtrl($scope, $location, $window) {
$scope.$on('$viewContentLoaded', function(event) {
$window._gaq.push(['_trackPageView', $location.url()]);
});
}
UPDATE:
for new version of google-analytics use this one
function MyCtrl($scope, $location, $window) {
$scope.$on('$viewContentLoaded', function(event) {
$window.ga('send', 'pageview', { page: $location.url() });
});
}
Best Answer
Alex, unfortunately, there is nothing you can do about the historical data.
However, you can use simple filter to exclude pages you don't want to see (the filter field above the report table, not filters related to account/profiles) -- see the attached screen below.
Make sure you select exclude and then pick Page dimension. The easiest way would be to use regular expressions, like:
This one would remove any pages that contain either "a", or "b" or "c".
The expression would be probably a bit more complicated in your case and I suggest using tools like RegEx Hero (free, online). I am not sure if there is anything common for the pages you would like to remove from the reports, but regular expression can do quite a lot :).
One last thing -- be aware there is a slight difference in segments and (table) filters. If you use segments for page dimension, you would end up with ALL the pages that were seen during a visit, which includes the page you set in the segment. Might be a bit confusing, but see this article for detailed explanation.