Force request to miss cache but still store the response

cachevarnish

I have a slow web app that I've placed Varnish in front of. All of the pages are static (they don't vary for a different user), but they need to be updated every 5 minutes so they contain recent data.

I have a simple script (wget --mirror) that crawls the entire website every 15 minutes. Each crawl takes about 5 minutes. The point of the crawl is to update every page in the Varnish cache so that a user never has to wait for the page to generate (since all pages have been generated recently thanks to the spider).

The timeline looks like this:

  • 00:00:00: Cache flushed
  • 00:00:00: Spider starts crawling to update cache with new pages
  • 00:05:00: Spider finishes crawling, all pages are updated until 00:15:00

A request that comes in between 0:00:00 and 0:05:00 might hit a page that hasn't been updated yet, and will be forced to wait a few seconds for a response. This isn't acceptable.

What I'd like to do is, perhaps using some VCL magic, always foward requests from the spider to the backend, but still store the response in the cache. This way, a user will never have to wait for a page to generate since there is no 5-minute window in which parts of the cache are empty (except perhaps at server startup).

How can I do this?

Best Answer

req.hash_always_miss should do the trick.

Don't do a full cache flush at the start of the spider run. Instead, just set the spider to work - and in your vcl_recv, set the spider's requests to always miss the cache lookup; they'll fetch a new copy from the backend.

acl spider {
  "127.0.0.1";
  /* or whereever the spider comes from */
}

sub vcl_recv {
  if (client.ip ~ spider) {
    set req.hash_always_miss = true;
  }
  /* ... and continue as normal with the rest of the config */
}

While that's happening and until the new response is in the cache, clients will continue to seamlessly get the older cache served to them (as long as it's still within its TTL).

Related Topic