Linux – High CPU usage – symptoms moving from server to server after bouncing

central-processing-unitclusterjavalinuxweblogic

First off, I apologize if I didn't include enough information to properly troubleshoot this issue. This sort of thing isn't my specialty, so it is a learning process. If there's something I need to provide, please let me know and I'll be happy to do what I can. The images associated with my question are at the bottom of this post.

We are dealing with a clustered environment of four WebLogic 9.2 Java application servers. The cluster utilizes a round-robin load algorithm. Other details include:

  • Java(TM) 2 Runtime Environment,
    Standard Edition (build 1.5.0_12-b04)
  • BEA JRockit(R) (build
    R27.4.0-90_CR352234-91983-1.5.0_12-20071115-1605-linux-x86_64, compiled mode)

Basically, I started looking at the servers' performance because our customers are seeing lots of lag at various times of the day. Our servers should easily handle the loads they are given, so it's not clear what's going on. Using HP Performance Manager, I generated some graphs that indicate that the CPU usage is completely out of whack. It seems that, at any given point, one or more of the servers has a CPU utilization of over 50%. I know this isn't particularly high, but I would say it is a red flag based on the CPU utilization of the other servers in the WebLogic cluster.

Interesting things to note:

  • The high CPU utilization was occurring only on server02 for several weeks. The server crashed (extremely rare; we are not sure if it's related to this) and upon starting it back up, the CPU utilization was normal on all 4 servers.
  • We restarted all 4 managed servers and the application server (on server01) yesterday, on 2/28. As you can see, server03 and server04 picked up the behavior that was seen on server02 before.
  • The CPU utilization is a Java process owned by the application user (appown).
  • The number of transactions is consistent across all servers. It doesn't seem like any one server is actually handling more than another.

If anyone has any ideas or can at least point me in the right direction, that would be great. Again, please let me know if there is any additional information I should post. Thanks!

server03
server02
server01
server04

Best Answer

Is the load balancing completely round robin or is it doing stickiness based on IP or cookie? You could have some kind of user traffic that sticks to one server and moves upon restart - especially if another one of your servers is calling an app on the cluster. So cross check it against actual hits to the server.

You may also have a race condition in the app that certain operations get it in a loop. For that you could take a thread dump (kill -3 pid) and pull it out of your stdout log and run something like Samurai on it to see what up.

I would also turn on garbage collection logging and see if GC times correlate with perceived lag times.