AWS ELB Latency issue – Valuable Tech Notes

I have two c3.2xlarge EC2 machines with Ubuntu environment both in us-west-2a AZ. Both contains same code with mySQL database from AWS RDS (db.r3.2xlarge). Both instances are added to an ELB. Both has one cron scheduled that runs twice in a day.

ELB has been configured to raise the alarm once the threshold crosses 5.0. The CPU utilization of both the instances are by average 30 – 50. At peak hours hits 100% for a minute or two and then returns to normal. But ELB constantly raises alarm thrice a day. At this time, both instances has

CPU     - ~50%
Memory  - total - 14979
          used  - ~6000
          free  - ~9000
RDS CPU - ~30%
          Connections - 200 to 300 /5,000

According to this https://aws.amazon.com/premiumsupport/knowledge-center/elb-latency-troubleshooting/ I could find nothing wrong with the instances. But still latency hits the peak and both instance fails to respond.

Till now, I am just removing one of the instance from the load balancer, restart the apache and then load it back and do the same for other instance. This does the job perfectly alright and the instances and ELB works good for next 6-10 hours. But this is not acceptable since, every day twice or thrice one has to take care of the server, needs it to restart.

I need to know, if there is anything wrong or any steps to be taken to resolve this problem.

Apache server-status contains too many such (~200/250 processes):

7-0 23176   1/2373/5118 C   30.95   3986    0   0.0 7.01    15.78   127.0.0.1   ip-xxx-xxx-xxx-xxx.us-west-2.comp   OPTIONS * HTTP/1.0

net.core.wmem_max net.core.rmem_max net.core.netdev_max_backlog net.core.somaxconn net.ipv4.tcp_rmem net.ipv4.tcp_wmem net.ipv4.tcp_no_metrics_save net.ipv4.tcp_timestamps net.ipv4.tcp_fin_timeout net.ipv4.tcp_max_tw_buckets net.ipv4.tcp_tw_recycle net.ipv4.tcp_synack_retries net.ipv4.tcp_keepalive_time net.netfilter.nf_conntrack_acct net.netfilter.nf_conntrack_generic_timeout net.netfilter.nf_conntrack_tcp_timeout_syn_sent net.netfilter.nf_conntrack_tcp_timeout_syn_recv net.netfilter.nf_conntrack_tcp_timeout_established net.netfilter.nf_conntrack_tcp_timeout_fin_wait net.netfilter.nf_conntrack_tcp_timeout_close_wait net.netfilter.nf_conntrack_tcp_timeout_last_ack net.netfilter.nf_conntrack_tcp_timeout_time_wait net.netfilter.nf_conntrack_tcp_timeout_close net.netfilter.nf_conntrack_tcp_timeout_max_retrans net.netfilter.nf_conntrack_tcp_timeout_unacknowledged net.netfilter.nf_conntrack_icmp_timeout net.netfilter.nf_conntrack_events_retry_timeout net.ipv4.netfilter.ip_conntrack_generic_timeout net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent2 net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_recv net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait net.ipv4.netfilter.ip_conntrack_tcp_timeout_last_ack net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait net.ipv4.netfilter.ip_conntrack_tcp_timeout_close net.ipv4.netfilter.ip_conntrack_tcp_timeout_max_retrans net.ipv4.netfilter.ip_conntrack_icmp_timeout net.netfilter.nf_conntrack_tcp_loose net.netfilter.nf_conntrack_max net.nf_conntrack_max net.netfilter.nf_conntrack_count

Best Answer

CPU utilization (%) is not the key, key is CPU load average (queue) and networking metrics, apache metrics, buffers, etc. Load balancers are very simple devices, problems, where LB's are involved in architecture usually are not related to the ELB's, but to the nature of how rest of the things work.

To see where is the problem, you most go through following steps:

Check if apache is responding to the local requests, if not - problem is NOT the ELB
Check states of apache workers (i.e. mod_status), tune MPM settings accordingly
Check CPU load average, if load average grows above CPU count and iowait grows - you have trouble with IO
Check if connection persistency is enabled and if it is really really required, if you really use sessions on webservers which require access to the same web instance
Check keepalive settings for apache, disable it or set very low timeout value
Check if you have iptables enabled on the instance and if nf_conntrack_max and nf_conntrack_count kernel parameters are configured with higher values. If you don't need it - disable and do not load modules at all
Stress test single instances wit http requests (hint: ab, jmeter)

Check and tune kernel parameters accordingly:

Apache is not responding after that? It's not the fault of ELB at all.

Best Answer

Related Solutions

Using AWS autoscaling to reboot unhealthy instance

AWS: Multi-region setup using single RDS instance

Related Topic