My guess is that you have some long running queries in your application. When they are executed they cause the connection to stay checked out of the pool for a long time (relative to the usual usage pattern), this causes your pool to become exhausted, grow, and continue to grow up to its maximum, at which point any remaining workers block waiting on connections to be released.
The first thing will be to track down when this happens, that is, is it a cyclical event, or random. If its the former you're in luck, as you can be ready time it happens. If you can't determine a pattern then you'll have to be vigilant.
You may be able to figure this from looking at your website monitoring logs, or sar
from your database to see if there are any correlating spikes.
If you can catch your database when its under load, you should execute the following commands on the mysql server
show innodb status;
show processlist;
The former will print out diagnostic information about the innodb engine (you are using innodb right?), the latter will print out the first few hundred chars of the query that was executing. Look for queries that have been running for a long time, queries generating temporary tables on disk, and queries that are blocked on a resource.
After that, the hard work begins. Use EXPLAIN
to estimate the cost of the query, and the resources it uses. Avoid queries that require sorting on disk via a tmp table. Look for long running reporting jobs, or other scheduled maintenance tasks that periodically lock or saturate your database. It could be something as simple as the backup task, or a job that rolls up old purchase order data.
I recommend having these three settings in your /etc/my.cnf
log_slow_queries
log-queries-not-using-indexes
set-variable = long_query_time=1
For a web application doing 20-30 requests per second, you can't afford to have anything show up in these logs.
btw, IMHO its pointless to increase your connection pool's size beyond your original size as this will only delay the onset of pool exhaustion by at best, a few seconds, and only put more pressure on your db right when it doesn't need it.
AWS RDS max_connections limit variable is based on Instance type, so you can upgrade your RDS or make more replica.
The RDS types with max_connections limit:
- t2.micro 66
- t2.small 150
- m3.medium 296
- t2.medium 312
- m3.large 609
- t2.large 648
- m4.large 648
- m3.xlarge 1237
- r3.large 1258
- m4.xlarge 1320
- m2.xlarge 1412
- m3.2xlarge 2492
- r3.xlarge 2540
Referring by max_connections at AWS RDS MySQL Instance Sizes in 2015
Update 2017-07
The current RDS MySQL max_connections setting is default by {DBInstanceClassMemory/12582880}
, if you use t2.micro with 512MB RAM, the max_connections could be (512*1024*1024)/12582880 ~= 40, and so on.
Each Web server could have many connections to RDS, which depends on your SQL requests from Web server.
Best Answer
http://www.mysqlperformanceblog.com/2010/09/10/cache-miss-storm/
As it turns out, our issue was a cache storm miss, AKA miss stampede.
We've solved it by implementing a 50% cache expiry. Basically for every item in memcache we create a second cache item with a similar key plus the appended 'regenerate' string. This item expires at exactly 50% of the typical cache expiry time, indicating to the next request that we're getting closer to expiring cache and the next request will need to try and regenerate cache.
This prevents the storm of users trying to regenerate cache at the same time and ensures our cache has the best chance of always staying fresh!
A Tricky one to track down!