Mysql – monitoring/troubleshooting aws rds high cpu usage

amazon-rdsamazon-web-servicesmonitoringMySQL

I have an RDS instance running and it uses an average of 20-30% cpu utilization. Last night it spiked up to 80% for a few minutes and I am trying to figure out why. Error logs show nothing, and I have no setup for any other kind of log on parameter groups (just default).
I tried running

show full processlist;

but I don't know if some specific process triggered at the time of the spike.

Best Answer

Having no audit logs reduces significantly the potential of a forensic analysis, and the issue is so common and generic that there is no rule-of-thumb that can help. Your only choice in this situation is to evaluate the other available RDS metrics and correlate the information with the applications that make use of your DB instance.

First of all, you should try to match the incident with the usage of your applications or the activity of interactive human users. Typically, these kind of incidents are derived of a spike in client application's load or users doing things without full knowledge of their impact.

Do you experience any DBConnections spike during the incidence? If so, does it make sense if such spike in connections come from an unexpected traffic (supposing that your DB is the data source of an exposed web application)? If that is the case, most surely your issue is related to security in the frontend layer and not in the DB itself.

Does the Read or Write IO increase during the incident? If so, that may imply that memory resources are insufficient to handle your application load during certain conditions. Also, it could mean that you have missing indexes or inefficient queries that load your instance unnecesarily. The MySQL EXPLAIN command should help here.

I would recommend you to ensure the performance_schema is enabled and deploy the sys schema to enable the MySQL Workbench performance tools, it helps a lot to detect schema and query bottlenecks.

Also, the Audit Log available through RDS OptionGroups and relaying it to Cloudwatch Logs (with a proper retention period) could help a lot in the future to investigate what the hell your applications (or users) are doing with the database. "show processlist" is only useful when used at the moment of the incident as it is a runtime tool.

Related Topic