Ignore utm_* values with varnish

cacheperformancequerystringvarnish

Can I 'ignore' query string variables before pulling matching objects from the cache, but not actually remove them from the URL to the end-user?

For example, all the marketing utm_source, utm_campaign, utm_* values don't change the content of the page, they just vary a lot from campaign to campaign and are used by all of our client-side tracking.

So this also means that the URL can't change on the client side, but it should somehow be 'normalized' in the cache.

Essentially I want all of these…

http://site.com/page/?utm_source=google

http://site.com/page/?utm_source=facebook&utm_content=123

http://site.com/page/?utm_campaign=usa

… to all access HIT the cache for http://site.com/page/

However, this URL would cause a MISS (because the param is not a utm_* param)

http://site.com/page/?utm_source=google&variation=5

Would trigger the cache for

http://site.com/page/?variation=5

Also, keeping in mind that the URL the user sees must remain the same, I can't redirect to something without params or any kind of solution like that.

Best Answer

Yes, but to do this, you must override the default vcl_hash. This is a dangerous thing to do only because people forget how Varnish works. Remember, the default logic is appended to whatever you provide. Therefore, if you want to change something like this, you must replicate the default logic in its entirety, modify it to your liking, and then prevent the default logic from running by returning at the end.

Here's the default vcl_hash from a version I have handy. As far as I know, this has been the same code since v1.0, so it probably matches yours; check your default.vcl to be sure.

sub vcl_hash {
    hash_data(req.url);
    if (req.http.host) {
        hash_data(req.http.host);
    } else {
        hash_data(server.ip);
    }
    return (lookup);
}

That's pretty straightforward: objects are differentiated by their URL and either their Host header or the IP address to which the client connected.

What you'd want to do to just replace the first line (hash_data(req.url)) with (pseudo):

set myurl = req.url minus utm bits;
hash_data(myurl);

However, you can't do this, because if you do, the next thing that will happen is that it will hash the whole URL! Remember, default VCL always runs. So, we have to replace it all:

sub vcl_hash {
    set stripped_url = regsuball(req.url,"([?&])utm_[^&?;]*","\1");
    # Now we potentially have foo.php?bar=baz&&&&thing=true
    set stripped_url = regsuball(stripped_url,"&[&]*","&");
    # Lastly, let's fix foo.php?utm_foo=bar -> foo.php?
    set stripped_url = regsuball(stripped_url,"\?$","");
    hash_data(stripped_url);
    if (req.http.host) {
        hash_data(req.http.host);
    } else {
        hash_data(server.ip);
    }
    return (lookup);
}

One final caveat: Please note, this is UNTESTED. But it should at least unambiguously communicate the idea. Inform me of errors if you find any and I'll gladly fix the code.