MySQL innodb has lots of processes hanging in “update” state once every minute

database-performanceinnodbMySQLperformance

Description of problem

I have a fairly large MySQL installation. There are at least 3 separate servers which run the innoDB storage engine. Once every minute, at the same time each minute, for about 3-4 seconds, each of my innodb machines suddenly stop performing well.

When I do a SHOW PROCESSLIST on each server at normal times, I see about 10-15 connections, doing their thing as normal:

+--------+------------------+--------------------------+------+---------+------+-------+------------------+
| Id     | User             | Host                     | db   | Command | Time | State | Info             |
+--------+------------------+--------------------------+------+---------+------+-------+------------------+
|  23457 | root             | localhost                | NULL | Query   |    0 | NULL  | show processlist | 
| 180042 | **********       | web2.***.com:49867       | ***  | Sleep   |    1 |       | NULL             | 
| 180129 | **********       | web1.***.com:54302       | ***  | Sleep   |    0 |       | NULL             | 
| 180155 | **********       | web2.***.com:50225       | ***  | Sleep   |    0 |       | NULL             | 
| 180163 | **********       | web1.***.com:54425       | ***  | Sleep   |    0 |       | NULL             | 
| 180172 | **********       | web1.***.com:54507       | ***  | Sleep   |    0 |       | NULL             | 
| 180181 | **********       | web4.***.com:34893       | ***  | Sleep   |    0 |       | NULL             | 
+--------+------------+--------------------------+------+---------+------+-------+------------------------+

Then all of a sudden, almost exactly in sync on each machine, at the same time each minute (meaning on :47 seconds after the minute each minute on each machine), processes will pile up hanging in the "update" state:

| 192938 |  **********       | web3.***.com:44248              | ***  | Query   |    3 | update | INSERT INTO user_stats (***_uid, data) VALUES (101670151,"{\"inbox\":{\"new\":12,\"spam_check\":1289 | 
| 192939 |  **********       | web4.***.com:50264              | ***  | Query   |    3 | update | INSERT INTO user_stats (***_uid, data) VALUES (17103785,"{\"inbox\":{\"new\":1,\"spam_check\":0,\"di | 
| 192940 |  **********       | web3.***.com:44258              | ***  | Query   |    3 | update | INSERT INTO user_stats (***_uid, data) VALUES (2245293,"{\"inbox\":{\"new\":14,\"spam_check\":128933 | 
| 192941 |  **********       | web3.***.com:44268              | ***  | Query   |    3 | update | INSERT INTO user_stats (***_uid, data) VALUES (105330063,"{\"inbox\":{\"new\":4,\"spam_check\":0,\"d | 
... 100-200 more just like this...
| 192941 |  **********       | web3.***.com:44268              | ***  | Query   |    3 | update | INSERT INTO user_stats (***_uid, data) VALUES (105330063,"{\"inbox\":{\"new\":4,\"spam_check\":0,\"d |

Upon closer inspection, it seems like it's high CPU usage at that moment (although I suppose high CPU could be caused by high disk I/O), because when it's in the midst of this, and I run something simple like SELECT NOW(), even that will take like 4 seconds to complete.

Here's what I know:

It isn't a rogue unoptimized query. Happens on different machines that not only run different queries, but also different tables.
It only happen on machines that do writing to innoDB tables. This does not happen on machines that only read innoDB, or machines that only write or read from MyISAM.

Questions

Is there a process that runs every minute on innoDB that takes up lots of CPU or disk I/O? Is this normal? I know that it could be a million different things, but I'm looking for known problems or solutions. Is there any more info I can provide to help solve this issue?

additional info

OS:

uname -a
Linux db04.****.com 2.6.18-194.17.4.el5 #1 SMP Wed Oct 20 13:03:08 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

File system:

/dev/sda4     ext3   785711096  80539996 665259216  11% /data

Raid config:

/opt/MegaRAID/MegaCli/MegaCli64 -LDInfo -Lall -aALL


Adapter 0 -- Virtual Drive Information:
Virtual Disk: 0 (target id: 0)
Name:Virtual Disk 0
RAID Level: Primary-1, Secondary-3, RAID Level Qualifier-0
Size:856704MB
State: Optimal
Stripe Size: 64kB
Number Of Drives:2
Span Depth:3
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default

MySQL Version

mysql> select version();
+---------------------------+
| version()                 |
+---------------------------+
| 5.0.80-enterprise-gpl-log | 
+---------------------------+
1 row in set (0.01 sec)

Best Answer

Are you sure you don't have a cron job running every minute?

What is your value for innodb_flush_method?

Since you have a RAID Controller with BBU-backed write cache (and data/logs not stored on a SAN) the recommended setting is: O_DIRECT

You might also want to use a tool like innotop to analyze your load better. Especially Pending I/O.

HTH

Edit: What's your value for innodb_buffer_pool_size as well?

Related Solutions

Mysql – Lots of NULL processes in MySQL

Check the values of your mysql server. Are you using default users in mysql? I'd also take a look at two things:

Be sure that only specific servers can connect to the mysql database. If you have a user that is connecting from % be sure that you have a good password (and good reason).
Be sure that someone isn't sending queries to your application outside of the code requirements that you have in place to run the application. Someone may be injecting data into your database without even visiting your site.

Keep track of you active connections for a while by paying close attention to netstat. The wait_timeout is an indication that someone/something has queried your server. Keeping close track will help you discover who they are. Once you do that if you need to quickly block them without making drastic changes, use iptables.

One last thing... enable mysql logging briefly. Your server will take a hit but you'll see exactly what's going on with mysql.

edit the my.cnf file and add (or update) this:

log             = /var/log/mysql/mysql.log

EDIT: I made the preceding suggestions because the NULL value in the info field indicates that no statements are being executed.

Mysql – Why do MySQL queries pile up in “Sending data” state

Well, do note that if I recall well (it's been a while since I did DB work) COUNT(*) queries without a WHERE clause on innodb tables are notoriously slower than on MyISAM and Memory tables.

Also, is this by any chance a Xen DomU?

What is the frontend language? If PHP, is it using MySQL or MySQLi? Are they using persistent connections?

You have not mentionned the underlying operating system, but in the case of Linux I would start by staring at the output of free -m, paying special attention to the last two lines to see if memory is tight overall.

[0:504] callisto:cyanotype $ free -m
             total       used       free     shared    buffers     cached
Mem:          3961       3816        144          0        184       1454
-/+ buffers/cache:       2177       1784
Swap:         2898          0       2898

Here we have a system that's healthy (it is my workstation). The second column excludes buffers and cache, so I am in fact using 2177mb of memory, and have 1784 megabytes readily available.

Last line shows that I don't use swap at all so far.

Then giving vmstat(8), to see if your system is trashing like mad would be useful, too.

[0:505] callisto:cyanotype $ vmstat 5 10
procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 0  0      0 134116 189828 1499948    0    0    11     3   44   49  1  1 98  0
 0  0      0 143112 189836 1489688    0    0     0     6  526 2177  1  1 98  0
 0  0      0 139268 190504 1491864    0    0   512     4  663 4704  2  1 96  1
 2  0      0 136688 191084 1493484    0    0   473     5  641 3039  1  1 97  1
 0  0      0  52636 191712 1518620    0    0  5066     4 1321 6600  8  2 86  4
 5  0      0  72992 193264 1377324    0    0 10742    31 1602 7441 12  3 80  5
 2  1      0  84036 193896 1202012    0    0 10126    43 2621 4305 31  2 57 10
 3  0      0  42456 195812 1060904    0    0  3970    75 55327 9806 43 5 41 10
 8  1      0  34620 197040 942940     0    0  3554    64 50892 12531 43 6 44 6
^C
[0:506] callisto:cyanotype $

(My desktop really isn't doing all that much here, sorry. What a waste of 8 perfectly good cores)

If you see a lot of process spending time in the 'b' column, that means they are blocked, waiting for something. Often that is IO. The important columns here are si and so. Check if they're populated with high values. If so, this may be your problem -- something is consuming a lot of memory, more than you can actually affort. Using top(4) and ordering the columns by memory % (shift+m while in top) might show the culprit(s).

It's not impossible that your system is trashing to and from swap, and saturating the disks, causing blocked threads and processes.The tool iostat(8)(part of package sysstat, usually) should be given a whirl to see if you have processes that are blocked, stuck on IO_WAIT. A saturated disk can spell bad news for the whole system under high load, especially if the system is swapping a lot.

You might run iostat with extended stats, every five seconds, for instance:

[0:508] callisto:cyanotype $ iostat -x 5
Linux 2.6.35-23-generic (callisto)  2010-11-30  _x86_64_    (8 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          16,55    0,12    2,70    2,60    0,00   78,02

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm      %util
sdc               0,00     2,00    1,00    0,80    27,20    22,40    27,56     0,01    3,33   3,33       0,60
sdd               0,00    12,60   67,60    4,80  4222,40   139,20    60,24     0,62    8,62   3,29      23,80
sde               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00       0,00
sdf               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          32,02    0,10    1,83    0,44    0,00   65,61

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               0,60     3,20   11,00    0,80   265,60    32,00    25,22     0,05    3,90   2,88   3,40
sdd               0,00     8,20    0,00    3,00     0,00    89,60    29,87     0,02    8,00   7,33   2,20
sde               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00
sdf               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          49,26    0,22    3,12    0,12    0,00   47,28

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               6,20     3,00    7,40    3,80   208,00    54,40    23,43     0,09    7,86   2,50   2,80
sdd               0,00    15,20    0,20    4,00     1,60   152,00    36,57     0,03    6,67   6,19   2,60
sde               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00
sdf               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          16,00    0,54    1,05    1,07    0,00   81,35

Device:         rrqm/s   wrqm/s     r/s     w/s   rsec/s   wsec/s avgrq-sz avgqu-sz   await  svctm  %util
sdc               4,20     0,00   31,40    0,00  3204,80     0,00   102,06     0,17    4,90   2,68   8,40
sdd               0,00    28,20    0,20    2,60     1,60   246,40    88,57     0,02    7,14   7,14   2,00
sde               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00
sdf               0,00     0,00    0,00    0,00     0,00     0,00     0,00     0,00    0,00   0,00   0,00

^C

This should allow you to easily see if your volumes are being saturated. For instance here, you can see that my disks are terribly underutilized, that the system spends most of its cpu cycles idling, etc etc. If that percentage is mostly in the % IOWAIT column, well you have an IO bottleneck here. You probably already know all this, but just covering all the bases to make sure.

The idea is that your config file changed, and you have no history of it (putting your config files under version control is a great idea for that very reason) -- and it is not impossible the size of a buffer suddendly changed thus making expensive queries like COUNT(*) without SELECT suddendly start to gobble up ressources.

Based on what you have learned from the previous use of the tools abive -- you should probably inspect the configuration file (being the only thing that changed, it is very likely the culprit) to see if the buffer values are sane for your average load.

How large are the buffers, like the query_cache_size value, and especially the sort_buffer sizes? (If that doesn't fit in memory, it will performed on disk, at a massive cost as I'm sure you can imagine).

How large is the innodb_buffer_pool_size?

How large is the table_cacheand most importantly, does that value fits within the system limits for file handles? (both open-files-limit in [mysqld] and at the OS level).

Also, I don't remember off the top of my head if this is still true, but I'm fairly certain that innodb actually locks the entire table whenever it has to commit an auto-increment fields. I googled and I could not find if that was still true or not.

You could also use innotop(1) to see what's going on more in detail, too.

I hope this helps somehow or gives you a starting point :)

Best Answer

Related Solutions

Mysql – Lots of NULL processes in MySQL

Mysql – Why do MySQL queries pile up in “Sending data” state

Related Topic