AH01797: client denied by server configuration: /usr/share/doc

apache-2.4googlebot

Since quite a while (over a month now) I see lines like the following in the apache logs:

180.76.15.138 - - [24/Jun/2015:16:13:34 -0400] "GET /manual/de/mod/module-dict.html HTTP/1.1" 403 396 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
180.76.15.159 - - [24/Jun/2015:16:28:34 -0400] "GET /manual/es/mod/mod_cache_disk.html HTTP/1.1" 403 399 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"
66.249.75.86 - - [24/Jun/2015:16:18:01 -0400] "GET /manual/es/programs/apachectl.html HTTP/1.1" 403 436 "-" "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"
[Wed Jun 24 16:13:34.430884 2015] [access_compat:error] [pid 5059] [client 180.76.15.138:58811] AH01797: client denied by server configuration: /usr/share/doc/apache2-doc/manual/de/mod/module-dict.html
[Wed Jun 24 16:18:01.037146 2015] [access_compat:error] [pid 2791] [client 66.249.75.86:56362] AH01797: client denied by server configuration: /usr/share/doc/apache2-doc/manual/es/programs/apachectl.html
[Wed Jun 24 16:28:34.461298 2015] [access_compat:error] [pid 2791] [client 180.76.15.159:25833] AH01797: client denied by server configuration: /usr/share/doc/apache2-doc/manual/es/mod/mod_cache_disk.html

The requests seem to really come from Baiduspider and Googlebot (checked using reverse DNS as explained here):

user@server:~$ host 66.249.75.86
86.75.249.66.in-addr.arpa domain name pointer crawl-66-249-75-86.googlebot.com.
user@server:~$ host crawl-66-249-75-86.googlebot.com
crawl-66-249-75-86.googlebot.com has address 66.249.75.86

I have read similar questions about this topic like this and this, but for those, these errors are actually preventing the site to work correctly. In my case instead, the html pages that the bots try to access do not exist, and this is therefore the expected behaviour of Apache. Only annoyance, is that Google seems slow at indexing my site, although the Google Webmaster Tools do not show any errors.

I am using Apache version 2.4.7 with the following vhost configuration:

<VirtualHost *:80>
    ServerName example.com
    ServerAlias www.example.com

    DocumentRoot "/var/www/example.com/public"
    <Directory />
        Options None
        AllowOverride None
        Order Deny,Allow
        Deny from all
        Require all denied
    </Directory>
    <Directory "/var/www/example.com/public">
        Options None
        AllowOverride FileInfo Limit Options=FollowSymLinks 
        Order Allow,Deny
        Allow from all
        Require all granted
    </Directory>

    ErrorLog /var/log/apache2/example.com/error.log
    CustomLog /var/log/apache2/example.com/access.log combined
</VirtualHost>

My questions are therefore:

  1. why are Baiduspider and Googlebot repeatedly trying to access content on my site which is not there and not referred by any links on the site?
  2. how do requests like GET /manual/de/mod/... get mapped to /usr/share/doc/apache2-doc/manual/de/mod/... while, to my understanding, they should go to /var/www/example.com/public/manual/de/mod/...?
  3. in general: should I worry about those lines as a sign of misconfiguration, or is there an explanation for them?

Best Answer

In 2.2, access control based on client hostname, IP address, and other characteristics of client requests was done using the directives Order, Allow, Deny, and Satisfy.

In 2.4, such access control is done in the same way as other authorization checks, using the new module mod_authz_host. The old access control idioms should be replaced by the new authentication mechanisms, although for compatibility with old configurations, the new module mod_access_compat is provided.

Looks like you've already set the new Require directive, so just remove the deprecated access directives and run sudo service apache2 reload