Robots.txt Advice – Need Advice for Optimizing Robots.txt File

robots.txt

I've noticed my indexed pages are dropping from Google like a rock. I'm reviewing everything that was changed in the last month. I noticed in my Google Webmaster Tools that there were some inaccessible pages, to compensate I blocked them with my robots.txt file, see bellow:

User-agent: Baiduspider
Disallow: /
User-agent: * 
Disallow: /index.php/ 
Disallow: /*? Disallow: /*.js$ 
Disallow: /*.css$ 
Disallow: /checkout/ 
Disallow: /tag/ 
Disallow: /catalogsearch/ 
Disallow: /review/ 
Disallow: /app/ 
Disallow: /downloader/ 
Disallow: /js/ 
Disallow: /lib/ 
Disallow: /media/ 
Disallow: /*.php$ 
Disallow: /pkginfo/ 
Disallow: /report/ 
Disallow: /skin/ 
Disallow: /var/ 
Disallow: /customer/ 
Disallow: /productdata/
Disallow: /productscripts/
Disallow: /includes/
Disallow: /wishlist/
Disallow: /shop/
Disallow: /supplier/
Disallow: /eng/catalog/gallery/
Disallow: /fr/catalog/gallery
Disallow: /eng/catalog/product/gallery/
Disallow: /fr/catalog/product/gallery
Disallow: /eng/cms/index/noCookies
Disallow: /fr/cms/index/noCookies
Disallow: /eng/catalog/view/_ignore_category/
Disallow: /fr/catalog/view/_ignore_category/
Disallow: /eng/customer/
Disallow: /fr/customer/
Disallow: /catalog/category/view/
Disallow: /catalog/product/view/


Disallow: /*?dir*
Disallow: /*?dir=desc
Disallow: /*?dir=asc
Disallow: /*?limit=all
Disallow: /*?mode*`

Is my robots.txt file too restrictive?

Best Answer

According to Google the best, most complete answer to your question is available in Google Webmaster Tools under Crawl > Blocked URLs. Depending on the number of results you have on category pages and the pagination configuration, the

Disallow: /*?

line will block Googlebot from crawling anything but the first page of a category list.

Also, I would not block access to /media/, or at least not to /media/catalog/ or /media/wysiwyg/. Images have value too!

.htaccess

RewriteRule ^robots.txt$ /robots.php [NC,L]

robots.php

<?php

// Find VirtualHost name and define as variable

$url = $_SERVER['SERVER_NAME'];

// Define multi-store websites url

$website1 = "website1url";
$website2 = "website2url"; 
$website3 = "website3url"; 
$website4 = "website4url";

// Test for requested website then display relevant robots.txt content

if (strpos($url,$website1) !== false) {
    include 'rbts-depend/website1.php';
}

if (strpos($url,$website2) !== false) {
    include 'rbts-depend/website2.php';
}

if (strpos($url,$website3) !== false) {
    include 'rbts-depend/website3.php';
}

if (strpos($url,$website4) !== false) {
    include 'rbts-depend/website4.php';
}



?>

rbts-depend/website1.php

<?php

$content = 
"User-agent: *
# INSERT ROBOTS CONTENT HERE"
;

echo $content;
?>

and so on...

Magento 1.9 SEO – Properly Setting Up Robots.txt

To block those kind of URLs with robots.txt you could use the following lines:

Disallow: /*?dir=*
Disallow: /*&dir=*
Disallow: /*?mode=*
Disallow: /*&mode=*
Disallow: /*?order=*
Disallow: /*&order=*
Disallow: /*?p=*
Disallow: /*&p=*

This will tell user-agents not to access all URLs which contain ascending/decending sorting, grid/list mode, position/name/price order and pagination. To learn more about robots.txt files click here.

IMPORTANT: I'm not saying that this is the best way to deal with Magentos sorting and pagination options in relation to duplicate content. There is a lot of discussion about what really is the best way to do this. For instance, you could also use Google Webmaster Tools URLParameters to give Google information about how to handle URLs containing specific parameters. You should be very careful when trying to block URLs because you might unintentionally block the wrong pages.

Best Answer

Related Solutions

Magento – Magento robots.txt for multi store

.htaccess

robots.php

rbts-depend/website1.php

Magento 1.9 SEO – Properly Setting Up Robots.txt

Related Topic