NGINX cache (same URL) first returns MISS to all Chrome, Curl and Wget

cachehttphttp-cachingnginxweb-server

I have a nginx cache proxy that gets content from an apache origin server.

I make requests from curl, wget and Chrome to verify the cache response. Problem is that, for same URL, I always get a MISS first time in each separate client.

I would expect after I make one request from any clients, the other clients would get a HIT, but I get MISS.

I only get a HIT when repeating the request in same exact client.

It feel like key would be related to user agent, but it is not:

proxy_cache_key $scheme://$host$request_uri;

To rule out different HTTP version and user agent, I sepcified them in the requests (wget uses http1.1 by default), they both show as GET in the logs, so not HEAD

wget --server-response --user-agent "foo" 'https://www.example.com/x.php?124'

HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Server: nginx/1.16.1
  Date: Tue, 03 Mar 2020 19:53:53 GMT
  Content-Type: text/html; charset=UTF-8
  Transfer-Encoding: chunked
  Connection: keep-alive
  X-Powered-By: PHP/5.4.16
  X-Accel-Expires: 3600
  Vary: Accept-Encoding
  X-Cache: MISS <<<<<<<<<<<<<<<<<<<<<<<<<< there

# repeating the request again with WGET will get a HIT

wget --server-response --user-agent "foo" 'https://www.example.com/x.php?124'

HTTP request sent, awaiting response...
  HTTP/1.1 200 OK
  Server: nginx/1.16.1
  Date: Tue, 03 Mar 2020 19:55:21 GMT
  Content-Type: text/html; charset=UTF-8
  Transfer-Encoding: chunked
  Connection: keep-alive
  X-Powered-By: PHP/5.4.16
  X-Accel-Expires: 3600
  Vary: Accept-Encoding
  X-Cache: HIT <<<<<<<<<<<<<<<<<<<<<<<<<<< there

# after request should be cached, a CURL request to same URL gets MISS again

curl -L -i --http1.1 --user-agent "foo" 'https://www.example.com/x.php?124'
HTTP/1.1 200 OK
Server: nginx/1.16.1
Date: Tue, 03 Mar 2020 19:56:37 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: keep-alive
X-Powered-By: PHP/5.4.16
X-Accel-Expires: 3600
Vary: Accept-Encoding
X-Cache: MISS <<<<<<<<<<<<<<<<<<<<<<<<<<< there

My config

http {

    sendfile            on;
    tcp_nopush          on;
    tcp_nodelay         on;
    keepalive_timeout   65;
    types_hash_max_size 2048;

    include             /etc/nginx/mime.types;
    default_type        application/octet-stream;

    include /etc/nginx/conf.d/*.conf;

    # lower value might show error: "upstream sent too big header"
    proxy_buffer_size   128k;
    proxy_buffers   8 256k;
    proxy_busy_buffers_size   256k;

    # fixes error request entity too large when uploading files
    client_max_body_size 256M;

    # main cache for images and some of the html pages
    proxy_cache_path /nginx_cache levels=1:2 keys_zone=nginx_cache:512m max_size=50g
                     inactive=90d use_temp_path=off;

    # deliver a cached copy in case of error at source server
    proxy_cache_background_update on;
    proxy_cache_use_stale updating error timeout http_500 http_502 http_503 http_504;
    proxy_cache_key $scheme://$host$request_uri;

    # set http version between nginx and origin servr, you can check the version in origin server log
    proxy_http_version 1.1;

    # enable gzip after we forced plain-text between cache and origin with Accept "" in some vhosts
    gzip_types text/plain text/css text/xml text/javascript application/javascript application/x-javascript application/xml image/jpeg image/png image/webp image/gif image/x-icon image/svg;
    gzip on;

    # security headers, iframe block, etc
    add_header X-Frame-Options sameorigin;
    add_header X-Content-Type-Options nosniff;
    add_header Strict-Transport-Security max-age=2678400;

    # default server(s) that don't match any specified hosts
    server {
        server_name _;
        listen 80 default_server;
        listen 443 ssl http2 default_server;
        root /var/www/html;
    }

    # include all our custom vhosts
    include /etc/nginx/adr_vhosts/*.conf;

} # end of http

My vhost config

server {
    listen       443 ssl http2;
    server_name  www.example.com;
    root         /usr/share/nginx/html;

    location / {

            # using the alt port to bypass the other nginx cache at source server (and X-Real-IP overwrite)
            proxy_pass       http://xx.xx.xx.xx:81; 

            proxy_cache             nginx_cache;

            # ask directly for the right host (including www), to avoid mismatches, additional redirects
            proxy_set_header Host      $host;

            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For  $proxy_add_x_forwarded_for;

            # sub_filter only works on plain text, disable gzip communication with origin server
            proxy_set_header Accept-Encoding "";

            # was it a hit or a miss
            add_header X-Cache $upstream_cache_status;

            # keep the x-accel header for debugging purposes
            proxy_pass_header "X-Accel-Expires";

    }

}

I disabled gzip compression between cache server and origin server with proxy_set_header Accept-Encoding ""; in order to use sub_filter in some location.
Then I re-activated gzip by gzip_types and gzip on.

Inside /nginx_cache there is a cache file saved for every client, these two are for nginx and wget, animation switches between the two files to see they are almost identical, except for the binary (or gzip?!) data above:

enter image description here

Edit: I get a HIT with all clients if I specify Accept-Encoding: gzip in the request ! I will look into that …

Edit 2: wget sends request header Accept-Encoding: identity, curl by default doesn't send any at all, while Chrome sends Accept-Encoding: gzip, deflate, br, cache properly gets a hit if I force these with any value as long as they are the same. Is that a missconfiguration at my end or is it normal behavior ? It acts like accept-encoding is part of the cache_key.

Best Answer

I am answering my own question in order to clarify the long details in the question and partially posting the solution...

I found that different clients (Curl vs Wget vs Chrome) each get a MISS cache reply, one after another for the exact same url because of the vary: Accept-Encoding in the response headers (e.g create a different cache variation for each Accept-Encoding)

Chrome: Accept-Encoding: gzip, deflate, br 

Wget: Accept-Encoding: identify 

Curl: n/a

The Vary: Accept-Encoding seems to come from my origin server and I confirmed that the cache always returns a HIT if I add:

proxy_ignore_headers Vary;

Just I am not sure if this is safe to do, I will open another question for that.