Nginx – Blocking ‘good’ bots in nginx with multiple conditions for certain off-limits URL’s where humans can go

nginxweb-crawler

After 2 days of searching/trying/failing I decided to post this here, I haven't found any example of someone doing the same nor what I tried seems to be working OK. I'm trying to send a 403 to bots not respecting the robots.txt file (even after downloading it several times). Specifically Googlebot. It will support the following robots.txt definition.

User-agent: *
Disallow: /*/*/page/

The intent is to allow Google to browse whatever they can find on the site but return a 403 for the following type of request. Googlebot seems to keep on nesting these links eternally adding paging block after block:

my_domain.com:80 - 66.x.67.x - - [25/Apr/2012:11:13:54 +0200] "GET /2011/06/
page/3/?/page/2//page/3//page/2//page/3//page/2//page/2//page/4//page/4//pag
e/1/&wpmp_switcher=desktop HTTP/1.1" 403 135 "-" "Mozilla/5.0 (compatible; G
ooglebot/2.1; +http://www.google.com/bot.html)"

It's a wordpress site btw. I don't want those pages to show up, even though after the robots.txt info got through, they stopped for a while only to begin crawling again later. It just never stops …. I do want real people to see this. As you can see, google get a 403 but when I try this myself in a browser I get a 404 back. I want browsers to pass.

root@my_domain:# nginx -V
nginx version: nginx/1.2.0

I tried different approaches, using a map and plain old nono if's and they both act the same:
(under http section)

map $http_user_agent $is_bot {
default 0;
~crawl|Googlebot|Slurp|spider|bingbot|tracker|click|parser|spider 1;
}

(under the server section)

location ~ /(\d+)/(\d+)/page/ {
if ($is_bot) {
return 403; # Please respect the robots.txt file !
}
}

I recently had to polish up my Apache skills for a client where I did about the same thing like this :

# Block real Engines , not respecting robots.txt but allowing correct calls to pass
# Google
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Googlebot/2\.[01];\ \+http://www\.google\.com/bot\.html\)$ [NC,OR]
# Bing
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ bingbot/2\.[01];\ \+http://www\.bing\.com/bingbot\.htm\)$ [NC,OR]
# msnbot
RewriteCond %{HTTP_USER_AGENT} ^msnbot-media/1\.[01]\ \(\+http://search\.msn\.com/msnbot\.htm\)$ [NC,OR]
# Slurp
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/5\.0\ \(compatible;\ Yahoo!\ Slurp;\ http://help\.yahoo\.com/help/us/ysearch/slurp\)$ [NC]

# block all page searches, the rest may pass
RewriteCond %{REQUEST_URI} ^(/[0-9]{4}/[0-9]{2}/page/) [OR]

# or with the wpmp_switcher=mobile parameter set
RewriteCond %{QUERY_STRING} wpmp_switcher=mobile

# ISSUE 403 / SERVE ERRORDOCUMENT
RewriteRule .* - [F,L]
# End if match

This does a bit more than I asked nginx to do but it's about the same principle, I'm having a hard time figuring this out for nginx.

So my question would be, why would nginx serve my browser a 404 ? Why isn't it passing, The regex isn't matching for my UA:

"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.30 Safari/536.5"

There are tons of example to block based on UA alone, and that's easy. It also looks like the matchin location is final, e.g. it's not 'falling' through for regular user, I'm pretty certain that this has some correlation with the 404 I get in the browser.

As a cherry on top of things, I also want google to disregard the parameter wpmp_switcher=mobile , wpmp_switcher=desktop is fine but I just don't want the same content being crawled multiple times.

Even though I ended up adding wpmp_switcher=mobile via the google webmaster tools pages (requiring me to sign up ….). that also stopped for a while but today they are back spidering the mobile sections.

So in short, I need to find a way for nginx to enforce the robots.txt definitions. Can someone shell out a few minutes of their lives and push me in the right direction please ?

I really appreciate ANY response that makes me think harder 😉

Best Answer

I think the best solution for this problem is going to involve multiple things. None of them involve blocking bots.

  1. Prevent WordPress from generating the invalid URLs in the first place.

    Figure out what caused those URLs to be generated and fix the problem.

  2. Determine if the URLs can be rewritten sanely. If so, have WordPress send a 301 redirect.

    For some of these URLs you may be able to send a 301 to redirect to the canonical URL. For others, though, it won't be so easy since the URL makes no sense whatsoever.

    While recent versions of WordPress send 301 redirects for some pages, plugins like Permalink Redirect can help with covering the things that WordPress doesn't. (This plugin may need an update or some customization; test carefully first.)

  3. For senseless URLs, serve a 410.

    The 410 Gone HTTP response tells the requester that the URL doesn't exist and is never coming back, so stop asking for it. Search engines can use this data to remove invalid URLs from their indexes.

    A sample configuration that should do it is (test this first!):

    location ~ #/page/\d+/page/# {
        return 410;
    }
    
Related Topic