Linux page cache slows down IO on dual cpu server with 64GB ram

ddiolinuxmemory

I have a massive problem with the linux page cache which slows down the IO. For instance if I copy a lvm partition with dd, linux cache the data in the buffers or caches (free –m). That is not the problem, but after the buffer reaches a special value, the copy process stop and slow down to a few mbs or even kbs. I have done many tests with writing to disk or /dev/null the problem has nothing to do with the source drive or the destination.

In Detail:

  • There are two nearly identical servers. Both runing CentOS 6.5 with the same kernel. They have the same disks, the same setup, the same other hardware, the same in all ways. The only difference is that one sever has 2 CPUs and 64GB ram and the other one 1 CPU and 32 GB ram.
  • Here is also an image of the following copy process: http://i.stack.imgur.com/tYlym.jpg
  • Here a new version also with meminfo. The meminfo is from a different run, so the values are not identival, but it ist the complte same behavior: http://i.stack.imgur.com/4SIJG.jpg
  • Start copy with dd or other filesystem copy programm.
  • Buffer or cache start to fill. All is fine.
  • Buffer or cache reaches a maximum number (on 64GB ram server a value like 32GB or 17GB; on 32GB ram server all free memory)
  • On 64GB ram server the copy process now stops or is limited to few mbs. On the 32GB ram server all is fine.
  • On 64GB ram server I can solve the problem for a short moment by forcing the cache with "sync; echo 3 > /proc/sys/vm/drop_caches". But of course the buffer starts to grow again instantly and the problem occur again.

Conclusion:

The problem has either something to do with the second cpu or with the total amount of memory. I have the “feeling” that the problem coud be, that every cpu has his own 32GB ram and the copy process running only on the cpu. So finaly the copy process inceased the buffer / cache to nearly 32GB or to the not used memory of the other cpu and then linux thinks hey there is still memory so lets us increase the buffer further but the hardware below cant access the memory, or something like that.

Has anybody an idea or a solution? Sure I can use dd with direct flag, but that dont solve the problem, because there is also extern access via samba and so on.

EDIT1:

Here also the /proc/zoneinfo from the 64GB ram server:
1. http://pastebin.com/uSnpQbeD (before dd starts) 2. http://pastebin.com/18YVTfdb (when dd stop working)

EDIT2:

  • VM settings: http://pastebin.com/U9E9KkFS
  • /proc/sys/vm/zone_reclaim_mode was on the 32 GB ram server 0 and on the 64 GB ram server 1. I never touch this values. The installer set these. I temp changed it to 0 and retry the test. Now all memory is uses for buffer and cache. So it looks great and like the other server. But then it instantly start swaping with full speed… I set swapiness to 0. That helps, but it still swaping few mb per secound. And it increase the buffers every secound. So it dont swap the buffer, it swap the memory of the vms to get more memory to increase the buffers… crazy. But maybe this is normal!?

EDIT3:

/proc/buddyinfo and numactl –hardware:
http://pastebin.com/0PmXxxin

FINAL RESULT

  • /proc/sys/vm/zone_reclaim_mode is for sure the technical rigth way, but the maschine did not work very well after it. For instance: if I copy a disk linux use now 100% of the free mem to buffer (not as before only XGB and then stop). But at the secound at that the last free mem was used to buffer, linux start swaping vm memory and increase the total amount of buffer and caches. The swap is normaly not needed in my system so the swap memory is on the same disk as some vms. In the result if a make a backup of these vms linux write the swap at the same time as I read from the disk for the backup. So it is bad to swap the vms but it is even worser that linux destroy my backup read speed… So the setting of /proc/sys/vm/zone_reclaim_mode to 0 dont solve the full problem… currently I run in a screen a script that sync and fushes cache every 10 sec… not nice but work much better for me. I have no webserver or normal file server on the system. I only run vms, make backups and store backups via samba. I dont like the solution.

Best Answer

The behaviour you are seeing is due to the way that Linux allocates memory on a NUMA system.

I am assuming (without knowing) that the 32GB system is non-numa, or not numa enough for Linux to care.

The behaviour of how to deal with numa is dictated by the /proc/sys/vm/zone_reclaim_mode option. By default, linux will detect if you are using a numa system and change the reclaim flags if it feels it would give better performance.

Memory is split up into zones, in numa system there is a zone for the first CPU socket and a zone for the second. These come up as node0 and node1. You can see them if you cat /proc/buddyinfo.

When the zone reclaim mode is set to 1, allocation from the first CPU socket will cause reclaim to occur on the memory zone associated with that CPU, this is because it is more efficient in terms of performance to reclaim from a local numa node. Reclaim in this sense is dropping pages such as clearing the cache, or swapping stuff out on that node.

Setting the value to 0 causes no reclaims to occur if the zone is filling up, instead allocating into the foreign numa zones for the memory. This comes at a cost of a breif lockage of the other CPU to gain exclusive access to that memory zone.

But then it instantly start swaping! after a few secouns: Mem: 66004536k total, 65733796k used, 270740k free, 34250384k buffers Swap: 10239992k total, 1178820k used, 9061172k free, 91388k cached

Swapping behaviour and when to swap is determined by a few factors, one being how active the pages are that have been allocated to applications. If they are not very active, they will be swapped in favour of the busier work occurring in the cache. I assume that the pages in your VMs dont get activated very often.