AIX CPU Utilization behavior

aixcentral-processing-unitmulti-core

We are facing a strange (for us) situation regarding the CPU utilization management. We have an lpar with 2 up to 4 online cores. While the application workload peaks, CPU usage is 100% (70% user + 30% kernel) and physical allocation 2.5 cores. I would expect to see a bigger physical allocation with lower usage. Is this rational? Should we have to define any threshold?

Regards,

Best Answer

What you describe is normal behavior. For the individual AIX LPAR to obtain some more physical processor (above the minimum entitlement), it needs to execute the actual code. It would seem strange if you'd seen an increased load (much higher than 8 in your case).

There are ways to allocate persistently processors to LPAR, but statically, not dynamically:

  • use dedicated processors, or
  • increase your minimum (you now use shared with a minimum of 2, you can increase to 2.5, for example).

There is no setting to ensure that at peak your usage will never reach 100%, and for a good reason. Your impression that there exists overhead to assign 2.5, and the overhead would be alleviated if LPAR would obtain 2.8 (and retain for a while) is just that - an impression.

In fact LPAR obtains the processor (above the 2.0 which it always gets) at each quantum time period and the overhead is constant; the overhead is the same to grow to 2.5 at first quantum, to 3.1 at the second quantum, back to 2.0 at the third quantum. As we are at the second quantum, the LPAR needs more, but the LPAR doesn't need to explicitly request anything; if the LPAR is still executing the code it is implicitly understood (by machine's hypervisor) that it needs to continue uninterrupted (without switching the LPAR out of the processor). Hypervisor observes the processors and says "hmmm, this LPAR is still executing code, let's wait and see, I'll give it as much as I can, and I'll throw it out of the processor only when its time ends". It got to 3.1 either because hypervisor forced it out, or it executed everything and every process entered sleep(). If the machine has plenty of free power, and LPAR tries to execute code that would need 4.0, it is allowed 4.0 instantly (without interruption at 2.0 or anywhere) to run until it reaches 4.0, and only then comes the interruption.

In this example, retaining 3.1 for many quanta would mean that you are wasting your precious machine power; if, as a result, you could see 90% usage which indicates that now you are wasting 10% of your money. Nothing more.

The procedure is not like the LPAR is using 2.0, then asking for more, using 0.1, then asking for more, using another 0.1, etc. It doesn't work this way. It receives extra 0.1 without request, simply because it still occupies the processors with workload; there is no additional overhead.

The 100% usage is very normal.

PS. What's with this word "core"? The thing that processes the machine code is called a "processor", and AIX world correctly uses this terminology. The physical thing that you plug into the socket is a "module".