AWS ELB Latency issue

amazon ec2amazon-elbamazon-rdsamazon-web-services

I have two c3.2xlarge EC2 machines with Ubuntu environment both in us-west-2a AZ. Both contains same code with mySQL database from AWS RDS (db.r3.2xlarge). Both instances are added to an ELB. Both has one cron scheduled that runs twice in a day.

ELB has been configured to raise the alarm once the threshold crosses 5.0. The CPU utilization of both the instances are by average 30 – 50. At peak hours hits 100% for a minute or two and then returns to normal. But ELB constantly raises alarm thrice a day. At this time, both instances has

CPU     - ~50%
Memory  - total - 14979
          used  - ~6000
          free  - ~9000
RDS CPU - ~30%
          Connections - 200 to 300 /5,000

According to this https://aws.amazon.com/premiumsupport/knowledge-center/elb-latency-troubleshooting/ I could find nothing wrong with the instances. But still latency hits the peak and both instance fails to respond.

Till now, I am just removing one of the instance from the load balancer, restart the apache and then load it back and do the same for other instance. This does the job perfectly alright and the instances and ELB works good for next 6-10 hours. But this is not acceptable since, every day twice or thrice one has to take care of the server, needs it to restart.

I need to know, if there is anything wrong or any steps to be taken to resolve this problem.

Latency

Memory

Apache server-status contains too many such (~200/250 processes):

7-0 23176   1/2373/5118 C   30.95   3986    0   0.0 7.01    15.78   127.0.0.1   ip-xxx-xxx-xxx-xxx.us-west-2.comp   OPTIONS * HTTP/1.0

Best Answer

CPU utilization (%) is not the key, key is CPU load average (queue) and networking metrics, apache metrics, buffers, etc. Load balancers are very simple devices, problems, where LB's are involved in architecture usually are not related to the ELB's, but to the nature of how rest of the things work.

To see where is the problem, you most go through following steps:

  • Check if apache is responding to the local requests, if not - problem is NOT the ELB
  • Check states of apache workers (i.e. mod_status), tune MPM settings accordingly
  • Check CPU load average, if load average grows above CPU count and iowait grows - you have trouble with IO
  • Check if connection persistency is enabled and if it is really really required, if you really use sessions on webservers which require access to the same web instance
  • Check keepalive settings for apache, disable it or set very low timeout value
  • Check if you have iptables enabled on the instance and if nf_conntrack_max and nf_conntrack_count kernel parameters are configured with higher values. If you don't need it - disable and do not load modules at all
  • Stress test single instances wit http requests (hint: ab, jmeter)
  • Check and tune kernel parameters accordingly:

    net.core.wmem_max
    net.core.rmem_max
    net.core.netdev_max_backlog
    
    net.core.somaxconn
    net.ipv4.tcp_rmem
    net.ipv4.tcp_wmem
    net.ipv4.tcp_no_metrics_save
    net.ipv4.tcp_timestamps
    net.ipv4.tcp_fin_timeout
    net.ipv4.tcp_max_tw_buckets
    net.ipv4.tcp_tw_recycle
    net.ipv4.tcp_synack_retries
    net.ipv4.tcp_keepalive_time
    
    net.netfilter.nf_conntrack_acct
    net.netfilter.nf_conntrack_generic_timeout
    net.netfilter.nf_conntrack_tcp_timeout_syn_sent
    net.netfilter.nf_conntrack_tcp_timeout_syn_recv
    net.netfilter.nf_conntrack_tcp_timeout_established
    net.netfilter.nf_conntrack_tcp_timeout_fin_wait
    net.netfilter.nf_conntrack_tcp_timeout_close_wait
    net.netfilter.nf_conntrack_tcp_timeout_last_ack
    net.netfilter.nf_conntrack_tcp_timeout_time_wait
    net.netfilter.nf_conntrack_tcp_timeout_close
    net.netfilter.nf_conntrack_tcp_timeout_max_retrans
    net.netfilter.nf_conntrack_tcp_timeout_unacknowledged
    net.netfilter.nf_conntrack_icmp_timeout
    net.netfilter.nf_conntrack_events_retry_timeout
    net.ipv4.netfilter.ip_conntrack_generic_timeout
    net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent
    net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_sent2
    net.ipv4.netfilter.ip_conntrack_tcp_timeout_syn_recv
    net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait
    net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait
    net.ipv4.netfilter.ip_conntrack_tcp_timeout_last_ack
    net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait
    net.ipv4.netfilter.ip_conntrack_tcp_timeout_close
    net.ipv4.netfilter.ip_conntrack_tcp_timeout_max_retrans
    net.ipv4.netfilter.ip_conntrack_icmp_timeout
    net.netfilter.nf_conntrack_tcp_loose
    net.netfilter.nf_conntrack_max net.nf_conntrack_max
    net.netfilter.nf_conntrack_count
    

Apache is not responding after that? It's not the fault of ELB at all.

Related Topic