To check top 20 largest consumer of physical memory (resident set size).
crash> ps -G | sed 's/>//g' | sort -k 8,8 -n | awk '$8 ~ /[0-9]/{ $8 = $8/1024" MB"; print }' | tail -20
To check the number of hugepages.
crash> p -d nr_huge_pages
Update:
A) crash dump was captured from following kernel version.
$ crash --osrelease vmcore.flat
2.6.32-279.5.2.el6.x86_64
B) Lets extract the vmlinux file from kernel-debug-debuginfo package.
$ rpm2cpio kernel-debug-debuginfo-2.6.32-279.5.2.el6.x86_64.rpm | \
cpio -idv ./usr/lib/debug/lib/modules/*/vmlinux
C) Open vmcore file using crash utility.
$ bunzip2 vmcore.flat.bz2
$ crash vmcore.flat ./usr/lib/debug/lib/modules/2.6.32-279.5.2.el6.x86_64/vmlinux
D) System Information.
crash> sys
KERNEL: ./usr/lib/debug/lib/modules/2.6.32-279.5.2.el6.x86_64/vmlinux
DUMPFILE: vmcore.flat [PARTIAL DUMP]
CPUS: 32
DATE: Tue Feb 5 12:11:52 2013
UPTIME: 00:04:12
LOAD AVERAGE: 3.03, 0.95, 0.34
TASKS: 578
NODENAME: ...
RELEASE: 2.6.32-279.5.2.el6.x86_64
VERSION: #1 SMP Fri Aug 24 01:07:11 UTC 2012
MACHINE: x86_64 (2700 Mhz)
MEMORY: 64 GB
PANIC: "[ 253.529344] Kernel panic - not syncing: Out of memory and no killable processes..."
a) Panic happened due to Out of memory but "panic_on_oom" parameter is disabled on the system.
crash> p -d sysctl_panic_on_oom
sysctl_panic_on_oom = $6 = 0
This parameter enables or disables panic on out-of-memory feature. If this is set to 0, the kernel will kill some rogue process, called oom_killer. Usually, oom_killer can kill rogue processes and system will survive. If this is set to 1, the kernel panics when out-of-memory happens.
b) So, how did we capture the vmcore at the time of oom event?
Well let's check the mm/oom_kill.c source code. It says that if nothing is left on the system to kill then simply hang or panic.
++++++
499 /* Found nothing?!?! Either we hang forever, or we panic. */
500 if (!p) {
501 read_unlock(&tasklist_lock);
502 cpuset_unlock();
503 panic("Out of memory and no killable processes...\n"); <<<------
504 }
505
++++++
So we reached to panic state and as kdump service was enabled on this system vmcore was captured.
E) Lets check in kernel ring buffer,
crash> log
[..]
[ 253.351427] Node 0 DMA free:15744kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15356kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
[ 253.352234] lowmem_reserve[]: 0 2955 32245 32245
[ 253.352812] Node 0 DMA32 free:120436kB min:4120kB low:5148kB high:6180kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:32kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:3026080kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:20kB slab_unreclaimable:16600kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1 all_unreclaimable? no
[ 253.353637] lowmem_reserve[]: 0 0 29290 29290
[ 253.354216] Node 0 Normal free:40580kB min:40868kB low:51084kB high:61300kB active_anon:956kB inactive_anon:536kB active_file:260kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:29992960kB mlocked:0kB dirty:0kB writeback:0kB mapped:460kB shmem:136kB slab_reclaimable:3640kB slab_unreclaimable:75128kB kernel_stack:4448kB pagetables:428kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 253.355047] lowmem_reserve[]: 0 0 0 0
[ 253.355624] Node 1 Normal free:39896kB min:45096kB low:56368kB high:67644kB active_anon:412kB inactive_anon:1668kB active_file:288kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):220kB present:33095680kB mlocked:0kB dirty:0kB writeback:0kB mapped:92kB shmem:80kB slab_reclaimable:3496kB slab_unreclaimable:87864kB kernel_stack:216kB pagetables:564kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
[ 253.356457] lowmem_reserve[]: 0 0 0 0
[ 253.357034] Node 0 DMA: 2*4kB 1*8kB 1*16kB 1*32kB 1*64kB 0*128kB 1*256kB 0*512kB 1*1024kB 1*2048kB 3*4096kB = 15744kB
[ 253.358351] Node 0 DMA32: 41*4kB 8*8kB 7*16kB 6*32kB 10*64kB 10*128kB 7*256kB 9*512kB 7*1024kB 5*2048kB 23*4096kB = 120468kB
[ 253.359674] Node 0 Normal: 718*4kB 558*8kB 278*16kB 169*32kB 88*64kB 47*128kB 13*256kB 5*512kB 0*1024kB 1*2048kB 1*4096kB = 40872kB
[ 253.360995] Node 1 Normal: 876*4kB 447*8kB 249*16kB 174*32kB 116*64kB 40*128kB 8*256kB 1*512kB 1*1024kB 2*2048kB 1*4096kB = 40952kB
[ 253.362319] 154 total pagecache pages
[ 253.362502] 0 pages in swap cache
[ 253.362684] Swap cache stats: add 0, delete 0, find 0/0
[ 253.362869] Free swap = 0kB
[ 253.363050] Total swap = 0kB
[ 253.526814] 16777215 pages RAM
[ 253.526999] 294628 pages reserved
[ 253.527190] 114911 pages shared
[ 253.527372] 16392561 pages non-shared
[..]
F) Lets check the memory status on the system at the time of crash.
crash> kmem -i
PAGES TOTAL PERCENTAGE
TOTAL MEM 16482587 62.9 GB ---- -------------------------------+
FREE 54610 213.3 MB 0% of TOTAL MEM |
USED 16427977 62.7 GB 99% of TOTAL MEM |
SHARED 4683 18.3 MB 0% of TOTAL MEM |
BUFFERS 118 472 KB 0% of TOTAL MEM |
CACHED 82 328 KB 0% of TOTAL MEM |
SLAB 46635 182.2 MB 0% of TOTAL MEM |
|
TOTAL SWAP 0 0 ---- ----------------------+ |
SWAP USED 0 0 100% of TOTAL SWAP | |
SWAP FREE 0 0 0% of TOTAL SWAP | |
| |
| |
crash> p -d totalram_pages | |
totalram_pages = $5 = 16482587 | |
| |
crash> !echo "scale=5;(16482587*4096)/2^30"|bc -q | |
62.87607 <<<-----[ Total physical memory is 62.9 GB ] <<<--|--------+
|
crash> p -d total_swap_pages |
total_swap_pages = $6 = 0 <<<------[ No Swap on the system ] <<<-----------+
- We have total ~63GiB of Physical Memory.
- Swap partition or file is not created on the system, so we don't have swap on this server.
- Memory used for cache is very less 328KB and for buffer it's 472KB.
- Memory used in slab is also very less only 182.2 MB.
G) The total memory allocated to process(es) is 0.00391006GiB.
crash> ps -G | tail -n +2 | cut -b2- | gawk '{mem += $8} END {print "total " mem/1048576 "GB"}'
total 0.00391006GB
H) The application process(es) are not utilizing memory on the system.
crash> ps -G | sed 's/>//g' | sort -k 8,8 -n | awk '$8 ~ /[0-9]/{ $8 = $8/1024" MB"; print }' | tail -20
965 2 21 ffff8808292f1500 IN 0.0 0 0 MB [ext4-dio-unwrit]
966 2 22 ffff8808292d4080 IN 0.0 0 0 MB [ext4-dio-unwrit]
967 2 23 ffff8808292ce040 IN 0.0 0 0 MB [ext4-dio-unwrit]
968 2 24 ffff8808299b5540 IN 0.0 0 0 MB [ext4-dio-unwrit]
969 2 25 ffff880829aa6040 IN 0.0 0 0 MB [ext4-dio-unwrit]
970 2 26 ffff880827367500 IN 0.0 0 0 MB [ext4-dio-unwrit]
971 2 27 ffff880827366aa0 IN 0.0 0 0 MB [ext4-dio-unwrit]
972 2 28 ffff880827366040 IN 0.0 0 0 MB [ext4-dio-unwrit]
97 2 23 ffff88082c1ac080 IN 0.0 0 0 MB [ksoftirqd/23]
973 2 29 ffff880827371540 IN 0.0 0 0 MB [ext4-dio-unwrit]
974 2 30 ffff880827370ae0 IN 0.0 0 0 MB [ext4-dio-unwrit]
975 2 31 ffff880827370080 IN 0.0 0 0 MB [ext4-dio-unwrit]
98 2 23 ffff88082c1bb500 IN 0.0 0 0 MB [watchdog/23]
99 2 24 ffff88082c1baaa0 IN 0.0 0 0 MB [migration/24]
3171 1 3 ffff880826ccaaa0 IN 0.0 27636 0.234375 MB auditd
1 0 1 ffff88082c41b500 UN 0.0 19348 0.339844 MB init
3772 1 0 ffff88102b257500 RU 0.0 64072 0.652344 MB sshd
1047 1 2 ffff881029524040 IN 0.0 11188 0.925781 MB udevd
4936 1047 4 ffff880ff342d540 IN 0.0 11184 0.925781 MB udevd
4937 1047 5 ffff88082a240080 IN 0.0 11184 0.925781 MB udevd
I) Let's verify the memory tuning parameters on the system.
crash> p -d sysctl_overcommit_memory
sysctl_overcommit_memory = $7 = 0
This value contains a flag that enables memory overcommitment. When this flag is 0, the kernel attempts to estimate the amount of free memory left when userspace requests more memory.
crash> p -d sysctl_overcommit_ratio
sysctl_overcommit_ratio = $8 = 50
When overcommit_memory is set to 2, the committed address space is not permitted to exceed swap plus this percentage of physical RAM.
crash> p -d zone_reclaim_mode
zone_reclaim_mode = $4 = 0
Zone_reclaim_mode allows someone to set more or less aggressive approaches to reclaim memory when a zone runs out of memory. If it is set to zero then no zone reclaim occurs.
crash> p -d min_free_kbytes
min_free_kbytes = $3 = 90112 <<<--------[ 88 MB ]
The minimum number of kilobytes to keep free across the system. This value is used to compute a watermark value for each low memory zone, which are then assigned a number of reserved free pages proportional to their size. When setting this parameter, as both too-low and too-high values can be damaging.
In other words, setting min_free_kbytes
too low prevents the system from reclaiming memory. This can result in system hangs and OOM-killing multiple processes. However, setting this parameter to a value that is too high (5-10% of total system memory) will cause your system to become out-of-memory immediately. Linux is designed to use all available RAM to cache file system data. Setting a high min_free_kbytes value results in the system spending too much time reclaiming memory.
The values of above parameters looks okay, So where is my memory ???
Assumptions:
- The main offender(s) is not in user-space. Based on my experience the unaccountable memory is due to Mellanox and DRBD modules but I am not sure in your case.
- As most of the pages are discarded from vmcore file in order to reduce the size of vmcore file ( core_collector makedumpfile -d 31 -c ). I am unable to check hugepage size.
Best Answer
First, I must ask: "shutdowns"? Do you mean that the machine reboots or does it actually halt? If it halts, it is either misconfigured (perhaps in BIOS) or something is actively shutting down the machine (i.e. init 0).
If not, your primary candidate would be /var/log/syslog and /var/log/kern.log as your problem sounds like a kernel panic or a software-triggered hardware-fault. Of course, if the server runs some service (e.g. apache) may give you a clue too.
Often, in situations like this, there are log entries generated, but because the machine is having difficulties, it won't manage to write the entries to disk. If the box is colocated, chances are that it is connected to a serial console by the colo partner. That is where I would look if I did not find anything suspicious in the above logs.
If the machine is not connected to a serial console and there is nothing in the log, you may want to consider sending syslog to a different box via network. Perhaps the network interface survives a bit longer, and the log messages can be read on the syslog server. Have a look at rsyslog or syslog-ng.
UPDATE:
I agree with @Johann below. Most likely cause of halt is processor temperature watchdog. Try checking/plotting temperature in box via lmsensors or smartctl (usually the easiest). I find that collectd is unparalleled at keeping track of large number of variables over time. It can do both IPMI and lm-sensors and hddtemp. Also, some BIOS:es log temperature halt events.