Web Crawler Creating Many Sessions – Solutions and Fixes

cachecrawlerrobots.txtsession

I recently noted that the table core_session is growing very fast ( +400MB ) I'm quite sure this is something abnormal.

  • I have truncated it but after a couple of days, It is again very big.
  • If I switch to file systems backend the problem persists
  • I found out that about every second I get a new session

Test:

The session is behaving correctly:
I performed some test on our staging server ( protected by .htaccess ):

  • So far everything looks working fine: 1 access produce 1 session.

Cause:
On live site online customer (Magento backend) shows about 500 customers ( I think each row represent open sessions):

eureka: Bots/Crawlers do not store cookies so everytime the access to the site they produce a new session

Questions:

  • Is it normal those bots do 100/200 access/session each?
  • How can I limit the related session issue?

Best Answer

BOT & SESSION
I'm not sure Bot needs sessions and the answer may be different for every bot for sure they don't store a cookie so it is no possible to re-use the old session matching the session ID stored in the client cookie.

How other E-Commerce fix this
Prestashop folks revolved the problem in this way:

  • Assigning to the bots the old sessions using the IP address to match the old session.

Unfortunately, Magento doesn't store the IP along with the session ID, I guess this is because Magento support more backend for session (files, DB, Redis)

How I fixed it
There is another question actually there is a similar question Magento generating aprox 20 session files per minute
I used the information in the old question to create a solution that integrates the Prestashop approach while it is still compatible with all Magento session backend

All the Magento session magic happens here: app/code/core/Mage/Core/Model/Session/Abstract/Varien.php
This is an abstract that is extended by all the other session models, so you cannot simply rewrite it but only copy it in /local/Mage/Core/Model/Session/Abstract/Varien.php.

So the below code do 3 things:

  1. It adds a more comprehensive list of bot
  2. Check if the client is a bot
  3. in that case, it calculates the session using the IP of the client and forces this id.

In short
Add the following methods to your copy of Mage_Core_Model_Session_Abstract_Varien

public function isBot()
    {
        $isbot = false;
        $bot_regex = '/BotLink|bingbot|AhrefsBot|ahoy|AlkalineBOT|anthill|appie|arale|araneo|AraybOt|ariadne|arks|ATN_Worldwide|Atomz|bbot|Bjaaland|Ukonline|borg\-bot\/0\.9|boxseabot|bspider|calif|christcrawler|CMC\/0\.01|combine|confuzzledbot|CoolBot|cosmos|Internet Cruiser Robot|cusco|cyberspyder|cydralspider|desertrealm, desert realm|digger|DIIbot|grabber|downloadexpress|DragonBot|dwcp|ecollector|ebiness|elfinbot|esculapio|esther|fastcrawler|FDSE|FELIX IDE|ESI|fido|H�m�h�kki|KIT\-Fireball|fouineur|Freecrawl|gammaSpider|gazz|gcreep|golem|googlebot|griffon|Gromit|gulliver|gulper|hambot|havIndex|hotwired|htdig|iajabot|INGRID\/0\.1|Informant|InfoSpiders|inspectorwww|irobot|Iron33|JBot|jcrawler|Teoma|Jeeves|jobo|image\.kapsi\.net|KDD\-Explorer|ko_yappo_robot|label\-grabber|larbin|legs|Linkidator|linkwalker|Lockon|logo_gif_crawler|marvin|mattie|mediafox|MerzScope|NEC\-MeshExplorer|MindCrawler|udmsearch|moget|Motor|msnbot|muncher|muninn|MuscatFerret|MwdSearch|sharp\-info\-agent|WebMechanic|NetScoop|newscan\-online|ObjectsSearch|Occam|Orbsearch\/1\.0|packrat|pageboy|ParaSite|patric|pegasus|perlcrawler|phpdig|piltdownman|Pimptrain|pjspider|PlumtreeWebAccessor|PortalBSpider|psbot|Getterrobo\-Plus|Raven|RHCS|RixBot|roadrunner|Robbie|robi|RoboCrawl|robofox|Scooter|Search\-AU|searchprocess|Senrigan|Shagseeker|sift|SimBot|Site Valet|skymob|SLCrawler\/2\.0|slurp|ESI|snooper|solbot|speedy|spider_monkey|SpiderBot\/1\.0|spiderline|nil|suke|http:\/\/www\.sygol\.com|tach_bw|TechBOT|templeton|titin|topiclink|UdmSearch|urlck|Valkyrie libwww\-perl|verticrawl|Victoria|void\-bot|Voyager|VWbot_K|crawlpaper|wapspider|WebBandit\/1\.0|webcatcher|T\-H\-U\-N\-D\-E\-R\-S\-T\-O\-N\-E|WebMoose|webquest|webreaper|webs|webspider|WebWalker|wget|winona|whowhere|wlm|WOLP|WWWC|none|XGET|Nederland\.zoek|AISearchBot|woriobot|NetSeer|Nutch|YandexBot|YandexMobileBot|SemrushBot|FatBot|MJ12bot|DotBot|AddThis|baiduspider|m2e/i';
        $userAgent = empty($_SERVER['HTTP_USER_AGENT']) ? FALSE : $_SERVER['HTTP_USER_AGENT'];
        $isBot = !$userAgent || preg_match($bot_regex, $userAgent);

        return $isBot;
    }

public function getBotSessionId()
{
    $s = (int)Mage::app()->getFrontController()->getRequest()->isSecure();
    $botIp = $_SERVER['REMOTE_ADDR'];
    return 'bot' . $s . sha1($botIp);
}

and change the getSessionId() in this way

public function setSessionId($id = null)
    {
        if (!is_null($id) && preg_match('#^[0-9a-zA-Z,-]+$#', $id)) {
            session_id($id);
        } elseif ($this->isBot()) {
            $this->setSessionId($this->getBotSessionId());
        }
        return $this;
    }

You may also want to change the behavior in case the bot access via HTTPS inside start() when you see session_regenerate_id(false);

                if (!$this->isBot()) {
                session_regenerate_id(false);
            }

Note
The bot session Id will be a little longer, sha1 function:

returned value is a 40-character hexadecimal number

this should not be a problem since PHP allows session to be 128 character long