Nginx – TTFB Longer on Google Cloud than AWS with same config

amazon-web-servicesgoogle-cloud-platformnginxperformanceubuntu-14.04

Long time lurker, first time poster on the StackExchange network.

I have been racking my brain on this for the past couple of days and have read a lot of threads and tried a lot of things, but have seen absolutely no movement. Here's the situation:

I set up a free tier EC2 box as a staging machine on Ubuntu 14.04 with LEMP, specifically Nginx and PHP7 with WordPress. We've built the site and everything works fine. It has a TTFB of 505 milliseconds.

I setup a free tier Cloud Compute box as a production machine with the same configuration, but the TTFB is 14 seconds. Know that the specs on the Google box are slightly better than those on EC2. g1-small (1 vCPU, 1.7 GB memory) vs. t2-micro (1 vCPU, 1 GB memory). SSDs on both sides.

I've tried a lot of things including much of the methods for diagnosing TTFB issues that are in this answer https://serverfault.com/a/350422. I've also tried upping the memory for PHP and WordPress. I've tried installing packages and plugins that people have suggesting for perf optimization. I'm specifically struggling with the fact that my staging instance needed none of this. Before you ask "why don't you just do it all on AWS," know that using Google Cloud Compute for production is a requirement for this project.

In the discussions I've had with people, someone said that Google Cloud doesn't have output flushing to serve content before the response is completed. Can anyone confirm?

Thanks in advance for any insight you can offer in pointing me in the right direction to solve this. Also, let me know any other details I can offer to make this easier to solve.


EDIT ONE: Answering questions below. Thank you so much for offering to help.

Are the two servers in the same geographic location? How big is the difference?

The EC2 instance is in US East or Virginia.
The GCE instance is in US-West1a which I believe is in Oregon.
I am in New York City, so not enough of a geographic difference to justify a 14 second TTFB. Also, tools that I've used have geographic positions such as Dallas (a reasonable mid-point between the two cities) and report the 14s TTFB as well.

First thing to work out is where the delay is. Please edit your question to >include a curl of the Google website, the matching Nginx access log entry, any >matching Nginx error log entry, and the PHP access / error logs. PHP access >logs need to be enabled. Also watch "top" while the curl is happening, and take >a representative screenshot. Finally, please share webpagetest.org tests of >both environments to demonstrate the problem, obfuscated if your domain names >are secret – web crawlers find all domains anyway. – Tim 3 hours ago

  • cURL

    time_namelookup: 0.000n
    time_connect: 0.078n
    time_appconnect: 0.000n
    time_pretransfer: 0.078n
    time_redirect: 0.000n
    time_starttransfer: 13.469n
    time_total: 13.469n

I don't have enough reputation points to post an image or a link to another image, but when I run top it php-fpm7.0 and mysqld popup.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
24625 www-data 20 0 374680 43352 30376 S 0.7 2.5 0:00.90 php-fpm7.0
21244 mysql 20 0 870388 71088 11012 S 0.3 4.1 0:10.57 mysqld

  • Nginx Access log entry

    [07/May/2017:06:01:58 +0000] "GET / HTTP/1.1" 200 58528 "-" "curl/7.35.0"

  • Nginx Error log entry

Nothing new here with that request

  • PHP Access log entry

Nothing in the PHP-FPM logs.

  • PHP Error log entry

Nothing from this request, but I did do a browser request prior to it and here's what is in the slow logging (I put the window at 5 seconds):

[07-May-2017 06:01:48]  [pool www] pid 24625
script_filename = /var/www/html/wordpress/index.php
[0x00007fad55e12810] mysqli_real_connect() /var/www/html/wordpress/wp-
includes/wp-db.php:1540
[0x00007fad55e126f0] db_connect() /var/www/html/wordpress/wp-includes/wp-db.php:658
[0x00007fad55e12620] __construct() /var/www/html/wordpress/wp-content/themes/xxxx/inc/artist-products.php:7
[0x00007fad55e12590] edb_db_init() /var/www/html/wordpress/wp-content/themes/xxxx/inc/db/items.php:258
[0x00007fad55e124d0] edb_get_product_link() /var/www/html/wordpress/wp-content/themes/xxxx/inc/artist-products.php:23
[0x00007fad55e123c0] edb_display_frontpage_items() /var/www/html/wordpress/wp-content/themes/xxxx/page-templates/home.php:95
[0x00007fad55e121e0] [INCLUDE_OR_EVAL]() /var/www/html/wordpress/wp-includes/template-loader.php:74
[0x00007fad55e12140] [INCLUDE_OR_EVAL]() /var/www/html/wordpress/wp-blog-header.php:19
[0x00007fad55e120a0] [INCLUDE_OR_EVAL]() /var/www/html/wordpress/index.php:17
  • EC2 Timeline

https://www.screencast.com/t/vxHTCcyf

  • GCE Timeline

https://www.screencast.com/t/22OXgA7T

The most obvious thing to try would be to write a barebones hello world php >script and measure ttfb of that, if you haven't. If it's anywhere near 14s then >you have a problem that has nothing to do with wordpress or the database. >"Google Cloud doesn't have output flushing" applies to Google App Engine HTTP >responses, which are returned en bloc — so this is not applicable to compute >instances.

Yep, I've done this. Hello World and other static files respond quickly. PHPInfo also responds quickly. If I recall correctly, all simple and static files were around 700 ms TTFB. Thanks for the clarification on the output flushing only being relevant to App Engine. At least I know there is a simple solution that I'm just missing.

The only thing I've been seeing that gives me anything to work with is the PHP slow logging.

[07-May-2017 00:56:39]  [pool www] pid 24793
script_filename = /var/www/html/wordpress/index.php
[0x00007fad55e14810] mysqli_real_connect() /var/www/html/wordpress/wp-includes/wp-db.php:1540
[0x00007fad55e146f0] db_connect() /var/www/html/wordpress/wp-includes/wp-db.php:658
[0x00007fad55e14620] __construct() /var/www/html/wordpress/wp-content/themes/xxxx/inc/artist-products.php:7
[0x00007fad55e14590] edb_db_init() /var/www/html/wordpress/wp-content/themes/xxxx/inc/db/items.php:258
[0x00007fad55e144d0] edb_get_product_link() /var/www/html/wordpress/wp-content/themes/xxxx/inc/artist-products$
[0x00007fad55e143c0] edb_display_frontpage_items() /var/www/html/wordpress/wp-content/themes/xxxx/page-templat$
[0x00007fad55e141e0] [INCLUDE_OR_EVAL]() /var/www/html/wordpress/wp-includes/template-loader.php:74
[0x00007fad55e14140] [INCLUDE_OR_EVAL]() /var/www/html/wordpress/wp-blog-header.php:19
[0x00007fad55e140a0] [INCLUDE_OR_EVAL]() /var/www/html/wordpress/index.php:1

This made me think that the database queries are slow, but I installed the query monitoring plugin for WP and all of the queries are showing as executing fast as well.

Best Answer

The most likely cause for the delay is the underlying infrastructure difference between AWS and GCP - in the cloud world, spec-sheets aren't always a true reflection on real-world performance, and using the cheapest set of instance types is never a good idea if you want to benchmark performance.

Almost every performance angle in GCP is linked to the number of CPUs - 2Gb of network per CPU core for example. Also, on Google Cloud, a g1-micro is a shared-core machine. From the link:

Shared-core machine types provide one virtual CPU that is allowed to run for a portion of the time on a single hardware hyper-thread on the host CPU running your instance.

On AWS, a t2.micro is classed as burstable - which on the face of it seems similar to Google's shared-core model but is subtly different:

T2 instances’ baseline performance and ability to burst are governed by CPU Credits. Each T2 instance receives CPU Credits continuously, the rate of which depends on the instance size. T2 instances accrue CPU Credits when they are idle, and use CPU credits when they are active. A CPU Credit provides the performance of a full CPU core for one minute.

In order to perform a true like-for-like comparison, I would strongly suggest using higher-performing instance types on both sides to guarantee resource availability, for example an Amazon m3.medium and Google n1-standard-1 - both providing 1 CPU and 3.75GB of RAM.

To remove geography from your possible list of culprits, I would also locate your AWS instance in the us-west-2, which is also in Oregon - it may be further from your location, but at least you'll be testing two instances which are a similar distance away, rather than one near and one far.

I would also try re-creating the instances multiple times to guard against "noisy neighbours" and hardware quirks - if you're using configuration management (and if you're running in the cloud, you should be) this should be a doddle.