Iptables – Long response time from server

apache-2.2iptablesmonitoringnetworkingresponse-time

We have a high traffic website, at peak it has 1000 concurrent users, and in minimum it has 100 users at the same time. In average it has 40,000 to 100,000 visit a day. The problem is sometimes it load very slow(we named this time as disaster time 🙂 ), In in that time when we try to load website with Firefox, it shows waiting...(I tried it with many providers around the world)

We monitor the server at disaster times , CPU load , Memory Usage are normal. Also slow query log of MySQL doesn't any query up to 1 sec. Apache hasn't any errors. iotop doesn't show anything that causes this disaster.

It is very interesting that disaster time and peak times don't have any relations. Sometimes disaster happen at 300 concurrent user and another time different. I can't find any relation between them.

How can I trace the packets at disaster time? I want to know this disaster is our Data Center's fault (such as upstream or firewall) or our server fault(such as Apache configuration, web application or anything else that I don't know).

For additional data just add a comment, then I edit my question to provide the data that you need to answer.

Best Answer

The number of concurrent users / visits has nothing to do with the capacity/performance of the system - it's all about concurrent connections and what those requests are doing.

Adding request response times to your server log would be a start - if these don't reflect the problem then the problem is likely on the network. I notice you make no reference to your webserver logs in your question - did you check them?

You consider that you have high traffic volumes, and your question implies you only have a single server. Why? (multiple servers would add complications to this specific such as load distribution, but would also also simplify much of the diagnostics, however it's a no-rainer for performance and availabiltiy).

Tracking the number of connections and their state also provides essential data in diagnosing the problem.

How can I trace the packets at disaster time?

With a packet capture program - this can be running anywhere from the client to the server. I use wireshark (available on Linux, MSWindows and others)

It would hae been useful if you'd mentioned what version/MPM your server is using and what OS it is running on.

Related Solutions

Very long (>300s) request processing time on Apache Server serving static content from particular IP’s

When I find this sort of thing, I first check:

DNS. Use a network dump like tcpdump or wireshark to check for this, not just eyeballing the configuration file. If you're certain this is not the issue,
What do traceroute / pings look like for those users? Do they all have something in common on their end? I've seen a bad NAT box cause no end of grief. I've also seen traffic local to a user cause my site to appear slower than it did for others without loaded connections, yet they NOTICE mine being slow.
Firewall / tunneling. Are they doing something silly like blocking all ICMP? Are they on a tunnel? If yes to both, then chances are it's PMTU discovery timing out in some strange way.

Note that 300s response times probably means Apache gave up on them, not that it was served. 5 minutes is a very long time for the server to wait, but it's even more insane for a client to wait so long.

Long waiting times before Apache 2.2 server response (Gentoo LAMP)

Do you know exactly what the apache worker processes are getting hung on? Try this to see:

mkdir /strace; ps auxw | grep httpd | awk '{print"-p " $2}' | xargs strace -o /strace/strace.log -ff -s4096 -r

Load a few new (i.e. not locally cached) pages in your browser, CTRL+C to stop strace then sort the strace.logs by time spent on each call:

for i in `ls /strace/*`; do echo $i; cat $i | cut -c11-17 | sort -rn | head; done

View any strace.logs with over 1.0 second calls and search by the time from the output of the previous command. This will point you to the exact step they are getting hung on.

Do you by change have a firewall like CSF installed? I saw this same problem on a VPS. When debugging httpd processes with strace it was taking up to 5 seconds or more on gettimeofday calls. Strangely I narrowed this down to CSF, which was trying to filter the venet0 interface, a loopback interface in OpenVZ or Virtuozzo containers. Setting this parameter in /etc/csf/csf.conf mostly fixed it for me:

"ETH_DEVICE_SKIP = "venet0,lo"

I say mostly because sometimes there still is 500-1000ms wait for connections to establish but it's a big improvement from 5000+.

Best Answer

Related Solutions

Very long (>300s) request processing time on Apache Server serving static content from particular IP’s

Long waiting times before Apache 2.2 server response (Gentoo LAMP)

Related Topic