Magento – How to solve 502 Bad Gateway with SOAP calls on nginx

nginxserver-setupsoapweb services

A Magento store, running on AWS using nginx for a webserver, has been working in production for a number of weeks. We have an integration with a shipping provider who uses SOAP to do your standard fulfillment integration "stuff". This has been working also for two weeks.

Yesterday, our client moved the site from AWS to Rackspace. This was a full move of the site, including the database. Everything seems to be working via web browser (able to browse, checkout, log in to backend, etc). However, now when SOAP calls are made to the server a 502 Bad Gateway error happens. The exact symptoms are as follows:

Regardless of v1 or v2
When making a login request $client->login('user','blah'); the service takes a very long time to respond
Eventually (say, 15-30 seconds?) there is a 502 Bad Gateway request.

I've looked at the logs, and see a number of things:

First, from nginx access logs:

<ip omitted> - - [20/May/2014:16:38:46 +0000] "POST /index.php/api/soap/index/ HTTP/1.1" 502 166 "-" "PHP-SOAP/5.4.10" "-"

Then, from the nginx error logs:

2014/05/20 17:04:44 [error] 3297#0: *17 recv() failed (104: Connection reset by peer) while reading response header from upstream, client: <ip omitted>, server: store.mydomain.com, request: "POST /index.php/api/index/index/ HTTP/1.1", upstream: "fastcgi://127.0.0.1:9000", host: "store.mydomain.com"

And finally, from the fastcgi log:

[20-May-2014 17:02:27] WARNING: [pool mydomain.com] child 23853 exited on signal 11 (SIGSEGV - core dumped) after 13474.896102 seconds from start [20-May-2014 17:02:27] NOTICE: [pool mydomain.com] child 3321 started [20-May-2014 17:03:58] WARNING: [pool mydomain.com] child 23854 exited on signal 11 (SIGSEGV - core dumped) after 13565.467026 seconds from start [20-May-2014 17:03:58] NOTICE: [pool mydomain.com] child 3327 started [20-May-2014 17:04:44] WARNING: [pool mydomain.com] child 23856 exited on signal 11 (SIGSEGV - core dumped) after 13611.667319 seconds from start

I've done a fair amount of googling, but haven't found anything concrete. I'm not enough of a nginx/php/fastcgi/whatever guy to be able to look at this logs and say "oh, that's it!", but I'm hoping somebody else might be able to.

Note that I can login to the admin fine, and again in general the site works as expected.

Thanks for any help!

Best Answer

Well, in this case the situation was apparently one of name resolution. Apparently the server was unable to resolve "itself." Sadly, I didn't find this result when searching, but Alan Storm (the god of Magento himself) mentioned this as a problem.

So, in short, read that post for more info, but we were able to fix this issue by adding a host entry on the server. After this point, things were working once again.

<servers ip address> storedomainname.com

Replication

Check out my repository above.

It has a file 100-router-script.php which should help replicate the error for you by running it in the root of your website.

The Patch

Magento have released a patch (SUPEE-4755) based on my write up.

PATCH_SUPEE-4755_EE_1.13.1.0_v1.sh

This patch is the exact same as my fix which gives it some validity.

This patch isn't publicly listed anywhere because Magento don't do that for anything but security patches for some reason. I know the patch file says EE_1.13.1.0, but I tested it on community edition 1.9 and it applied fine.

You can grab a copy of the patch here: https://github.com/convenient/magento-ce-ee-config-corruption-bug/blob/master/PATCH_SUPEE-4755_EE_1.13.1.0_v1.sh

And for historical reasons or in case my github account ever gets deleted

/**
* Initialization of core configuration
*
* @return Mage_Core_Model_Config
*/
public function init($options=array())
{
    $this->setCacheChecksum(null);
    $this->_cacheLoadedSections = array();
    $this->setOptions($options);
    $this->loadBase();

    $cacheLoad = $this->loadModulesCache();
    if ($cacheLoad) {
        return $this;
    }
    //100 Router Fix Start
    $this->_useCache = false;
    //100 Router Fix End
    $this->loadModules();
    $this->loadDb();
    $this->saveCache();
    return $this;
}

If this doesn't fix your Magento instance

This patch file will solve all reasonably vanilla magento instances (anything that reinitialises the configuration cache using the defined init or reinit methods).

If you have any code which calls the loadDb function - like n98-magerun - then you're likely to have a bad time and should probably consider refactoring your code to call reinit or init on Mage_Core_Model_Config.

If you still have trouble, I recommend you read my write up and

Implement the logic mentioned in the Debugging the Issue section.
Wait for some log data to appear.
Contact me!

A theory on why supee 6788 makes the error worse

I've just posted this in the chat room and thought I'd pop it here for posterity.

Something to remember is that magento has ALWAYS had this bug in it's background. and that supee 6788 has just made it something that is more likely to happen

I just had a look at the supee 6788 patch for EE1.14 This is my working theory.

Mage_Core_Controller_Varien_Router_Admin was patched to override the addModule function

Now, I can't quite recall the order in which things occur during the mage request initialisation But this modification now adds two calls to Mage::getConfig()->getNode() on each single request, quite early on in the flow. If there is an object cache param failure with useCache during either of these two calls, then it could potentially trigger an incomplete configuration being written to the cache (getNode can trigger a reinit, as it tries to load values from the cache, and on a failed load, it will reinit the entire config cache and re-save it via the reinit() function)

So perhaps just adding these two extra getNodes on to a very early part of the request, when the full configuration may not be loaded could have caused this error to happen Either way I wouldn't sweat a full understanding of the issue. It'd be a lot of work for not a lot of gain, and looking into Mage_Core_Model_Config::loadDb (with a focus on what effect the useCache flag may have on it) could drive a man insane.

Magento – 502 bad gateway nginx server

Please try:

Clean all HHVM compilation cache (i don't know how it works exactly but as far as I know hhvm magic is based on compilation)
If still happens, look for any layout xml(specially local.xml on your theme)/*.phtml/config.xml/attribute using that helper.
if you can, remove the entry with path: advanced/modules_disable_output/[that_module] on core_config_data table.

If problem persist turn developer_mode ON or look at the exception.log to find the complete error stack.

Best Answer

Related Solutions

Magento Broken: Fix Call to getCode() on Boolean & No 404 CMS Page Found

Replication

The Patch

If this doesn't fix your Magento instance

A theory on why supee 6788 makes the error worse

Magento – 502 bad gateway nginx server

Related Topic