EDIT: it turns out that this is not the only instances of the bug happening. It happens so often in my computer. Sometimes it involved another seemingly random process, such as: chromium-browser
, teamviewer
and mongod
. I began to notice it because it crashed MongoDB database days ago. As far as today, there are at least three times this is happened. I have no problem before, when I use Ubuntu 14.04 LTS, my system is (DELL INSPIRON 3650). It is standard CPU, no-overclocking involved.
I have a ubuntu 16.04 with mongodb(3.4) installation. Several hours ago it spiked in its operation, consuming 100% of CPU resources.
Here's the result from top
top - 21:40:05 up 2 days, 8:30, 1 user, load average: 17,08, 17,03, 17,01
Tasks: 174 total, 15 running, 153 sleeping, 0 stopped, 6 zombie
%Cpu(s): 0,0 us, 66,8 sy, 0,0 ni, 33,2 id, 0,0 wa, 0,0 hi, 0,0 si, 0,0 st
KiB Mem : 8117148 total, 5307248 free, 981712 used, 1828188 buff/cache
KiB Swap: 520188 total, 520188 free, 0 used. 6427752 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1160 mongodb 20 0 0 0 0 Z 99,7 0,0 627:44.03 mongod
14214 root 20 0 26176 1356 1168 R 99,7 0,0 147:03.56 systemctl
3636 root 20 0 232068 37388 28740 S 0,3 0,5 1:04.03 Xorg
I try kill the process with no luck, any kill -9 <MONGOD PID>
can't kill it. I also can't reboot the system. It simply unresponsive. Below is result from sudo service mongod stop
command
Failed to retrieve unit: Connection timed out
Failed to stop mongod.service: Connection timed out
See system logs and 'systemctl status mongod.service' for details.
Failed to get load state of mongod.service: Connection timed out
I still can ssh
into the server, but nothing I can do to stop the mongod process. Can anyone help me?
ADDITIONAL NOTE
the pstree -p -s 1160
command gives me
systemd(1)───mongod(1160)─┬─{ftdc}(1247)
├─{mongod}(1239)
└─{signalP.gThread}(1214)
as per tailf -100 /var/log/syslog
command gives me more interestin result. It displays a repeated message, below is one of them:
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505244] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ftdc:1247]
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505245] Modules linked in: rfcomm xt_multiport iptable_filter ip_tables x_tables rtsx_usb_ms bnep memstick binfmt_misc snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel arc4 dcdbas dell_smm_hwmon kvm snd_hda_codec_realtek irqbypass snd_hda_codec_generic crct10dif_pclmul rtl8723be crc32_pclmul ghash_clmulni_intel snd_hda_intel aesni_intel snd_hda_codec btcoexist rtl8723_common aes_x86_64 snd_hda_core lrw joydev snd_hwdep glue_helper rtl_pci input_leds rtlwifi snd_pcm ablk_helper snd_seq_midi cryptd mac80211 snd_seq_midi_event snd_rawmidi intel_cstate btusb intel_rapl_perf btrtl snd_seq cfg80211 snd_seq_device snd_timer snd serio_raw soundcore mei_me mei shpchp hci_uart btbcm btqca btintel bluetooth mac_hid intel_lpss_acpi intel_lpss acpi_als kfifo_buf industrialio acpi_pad parport_pc ppdev lp parport autofs4 btrfs xor raid6_pq dm_mirror dm_region_hash dm_log rtsx_usb_sdmmc rtsx_usb hid_generic usbhid nouveau mxm_wmi i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt r8169 psmouse fb_sys_fops mii drm ahci libahci wmi pinctrl_sunrisepoint video pinctrl_intel i2c_hid hid fjes
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505277] CPU: 1 PID: 1247 Comm: ftdc Tainted: G W L 4.8.0-53-generic #56~16.04.1-Ubuntu
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505277] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505278] task: ffffa024db476ac0 task.stack: ffffa024d83a4000
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505278] RIP: 0010:[<ffffffff8b50b336>] [<ffffffff8b50b336>] smp_call_function_many+0x1f6/0x250
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505281] RSP: 0018:ffffa024d83a7b38 EFLAGS: 00000202
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505281] RAX: 0000000000000003 RBX: 0000000000000200 RCX: 0000000000000003
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505282] RDX: ffffa024e659d380 RSI: 0000000000000200 RDI: ffffa024e649a288
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505282] RBP: ffffa024d83a7b70 R08: 0000000000000000 R09: 000000000000000d
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505282] R10: 0000000000000008 R11: ffffa024e649a288 R12: ffffa024e649a288
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505283] R13: ffffa024e649a280 R14: ffffffff8b472400 R15: ffffa024d83a7b80
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505284] FS: 00007f871ddd2700(0000) GS:ffffa024e6480000(0000) knlGS:0000000000000000
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505284] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505284] CR2: 00007f95bc40323f CR3: 0000000258e11000 CR4: 00000000003406e0
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505285] Stack:
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505285] 000000000001a240 0100000000000001 ffffa024d3ebf800 ffffffffffffffff
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505287] ffffa024d3ebfad8 0000000000000000 ffffffffffffffff ffffa024d83a7bb8
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505288] ffffffff8b472865 ffffa024d3ebf800 0000000000000000 ffffffffffffffff
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505289] Call Trace:
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505291] [<ffffffff8b472865>] native_flush_tlb_others+0x65/0x130
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505292] [<ffffffff8b472a43>] flush_tlb_mm_range+0x63/0x150
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505294] [<ffffffff8b5d62b4>] tlb_flush_mmu_tlbonly+0x64/0xd0
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505295] [<ffffffff8b5d75b2>] tlb_flush_mmu+0x12/0x20
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505297] [<ffffffff8b61595d>] zap_huge_pmd+0x20d/0x3b0
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505298] [<ffffffff8b5d9168>] unmap_page_range+0x928/0x940
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505299] [<ffffffff8b47fc92>] ? mmput+0x12/0x130
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505301] [<ffffffff8b5d91fd>] unmap_single_vma+0x7d/0xe0
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505302] [<ffffffff8b5d9668>] zap_page_range+0xc8/0x140
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505304] [<ffffffff8b5ef47e>] SyS_madvise+0x43e/0x930
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505305] [<ffffffff8bc9a876>] entry_SYSCALL_64_fastpath+0x1e/0xa8
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505306] Code: d2 e8 3f 94 33 00 3b 05 ed 3a e5 00 89 c1 0f 8d 99 fe ff ff 48 98 49 8b 55 00 48 03 14 c5 60 c4 35 8c 8b 42 18 a8 01 74 09 f3 90 <8b> 42 18 a8 01 75 f7 eb bf 0f b6 4d d0 4c 89 fa 4c 89 f6 44 89
Here's the output from echo l > /proc/sysrq-trigger
This is for CPU3
[207345.496706] NMI backtrace for cpu 3
[207345.496707] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G W L 4.8.0-53-generic #56~16.04.1-Ubuntu
[207345.496707] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
[207345.496708] task: ffffa024dc428000 task.stack: ffffa024dc460000
[207345.496708] RIP: 0010:[<ffffffff8b4cf41a>] [<ffffffff8b4cf41a>] native_queued_spin_lock_slowpath+0x17a/0x1a0
[207345.496708] RSP: 0018:ffffa024e6583b30 EFLAGS: 00000002
[207345.496709] RAX: 0000000000000101 RBX: 0000000000000092 RCX: 0000000000000001
[207345.496709] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffffa024d4111d08
[207345.496709] RBP: ffffa024e6583b30 R08: 0000000000000101 R09: 000000000000002a
[207345.496710] R10: 00000000ffffffff R11: 0000000000000000 R12: ffffa024d4111d08
[207345.496710] R13: ffffa024dc583a00 R14: ffffa024d4111c00 R15: ffffa024d4111c00
[207345.496711] FS: 0000000000000000(0000) GS:ffffa024e6580000(0000) knlGS:0000000000000000
[207345.496711] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[207345.496711] CR2: 00001372a3be0000 CR3: 0000000258e11000 CR4: 00000000003406e0
[207345.496712] Stack:
[207345.496712] ffffa024e6583b48 ffffffff8bc9a7e7 000000000000002a ffffa024e6583b98
[207345.496712] ffffffffc02f9dc3 ffffa024dc583580 0000000000000010 ffffa024e6583b98
[207345.496713] ffffa024d4111c00 000000000000002a ffffa024d4110c00 ffffa024d4111c00
[207345.496713] Call Trace:
[207345.496713] <IRQ> ^Ad [<ffffffff8bc9a7e7>] _raw_spin_lock_irqsave+0x37/0x3f
[207345.496714] [<ffffffffc02f9dc3>] nvkm_fantog_update+0x43/0x110 [nouveau]
[207345.496714] [<ffffffffc02f9ee8>] nvkm_fantog_set+0x38/0x40 [nouveau]
[207345.496714] [<ffffffffc02f936f>] nvkm_fan_update+0xbf/0x200 [nouveau]
[207345.496715] [<ffffffffc02f94e9>] nvkm_therm_fan_set+0x19/0x20 [nouveau]
[207345.496715] [<ffffffffc02f8beb>] nvkm_therm_update+0x9b/0x2e0 [nouveau]
[207345.496715] [<ffffffffc02f8e47>] nvkm_therm_alarm+0x17/0x20 [nouveau]
[207345.496716] [<ffffffffc02fc0d0>] nvkm_timer_alarm_trigger+0x100/0x150 [nouveau]
[207345.496716] [<ffffffffc02fc1ef>] nvkm_timer_alarm+0x7f/0xd0 [nouveau]
[207345.496716] [<ffffffffc02f9e85>] nvkm_fantog_update+0x105/0x110 [nouveau]
[207345.496717] [<ffffffffc02f9eaa>] nvkm_fantog_alarm+0x1a/0x20 [nouveau]
[207345.496717] [<ffffffffc02fc0d0>] nvkm_timer_alarm_trigger+0x100/0x150 [nouveau]
[207345.496718] [<ffffffffc02fc4f2>] nv04_timer_intr+0x62/0xb0 [nouveau]
[207345.496718] [<ffffffffc02fbf77>] nvkm_timer_intr+0x17/0x20 [nouveau]
[207345.496718] [<ffffffffc02aa7c7>] nvkm_subdev_intr+0x17/0x20 [nouveau]
[207345.496719] [<ffffffffc02eea15>] nvkm_mc_intr+0xe5/0x190 [nouveau]
[207345.496719] [<ffffffffc02f35f3>] nvkm_pci_intr+0x53/0x80 [nouveau]
[207345.496719] [<ffffffff8b4e0011>] __handle_irq_event_percpu+0x81/0x1a0
[207345.496720] [<ffffffff8b4e0162>] handle_irq_event_percpu+0x32/0x80
[207345.496720] [<ffffffff8b4e01ee>] handle_irq_event+0x3e/0x60
[207345.496720] [<ffffffff8b4e3bf0>] handle_edge_irq+0x80/0x150
[207345.496721] [<ffffffff8b4302cd>] handle_irq+0x1d/0x30
[207345.496721] [<ffffffff8bc9d0db>] do_IRQ+0x4b/0xd0
[207345.496721] [<ffffffff8bc9b1c2>] common_interrupt+0x82/0x82
[207345.496722] <EOI> ^Ad [<ffffffff8bb1934b>] ? cpuidle_enter_state+0x12b/0x2d0
[207345.496722] [<ffffffff8bb19527>] cpuidle_enter+0x17/0x20
[207345.496722] [<ffffffff8b4c7a0a>] call_cpuidle+0x2a/0x50
[207345.496723] [<ffffffff8b4c7dee>] cpu_startup_entry+0x29e/0x350
[207345.496723] [<ffffffff8b4518b1>] start_secondary+0x151/0x190
[207345.496724] Code: 41 39 c0 74 e6 4d 85 c9 c6 07 01 74 30 41 c7 41 08 01 00 00 00 e9 51 ff ff ff 83 fa 01 0f 84 af fe ff ff 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 5d c3 f3 90 4c 8b 09
This is for CPU 0
[207345.495724] NMI backtrace for cpu 0
[207345.495725] CPU: 0 PID: 14214 Comm: systemctl Tainted: G W L 4.8.0-53-generic #56~16.04.1-Ubuntu
[207345.495725] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
[207345.495726] task: ffffa0241a56db80 task.stack: ffffa0241a618000
[207345.495726] RIP: 0010:[<ffffffff8b50b336>] [<ffffffff8b50b336>] smp_call_function_many+0x1f6/0x250
[207345.495726] RSP: 0018:ffffa0241a61bce0 EFLAGS: 00000202
[207345.495727] RAX: 0000000000000003 RBX: 0000000000000200 RCX: 0000000000000003
[207345.495727] RDX: ffffa024e659cc68 RSI: 0000000000000200 RDI: ffffa024e641a288
[207345.495728] RBP: ffffa0241a61bd18 R08: 0000000000000000 R09: 000000000000000e
[207345.495728] R10: 0000000000000008 R11: ffffa024e641a288 R12: ffffa024e641a288
[207345.495728] R13: ffffa024e641a280 R14: ffffffffc09ca790 R15: 0000000000000000
[207345.495729] FS: 00007fe04de0f880(0000) GS:ffffa024e6400000(0000) knlGS:0000000000000000
[207345.495729] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[207345.495729] CR2: 000055f9604a6040 CR3: 000000019a651000 CR4: 00000000003406f0
[207345.495730] Stack:
[207345.495730] 000000000001a240 0100000000000001 00000000fffffffb ffffffffc09ca790
[207345.495730] 0000000000000000 0000000000000000 0000000000000000 ffffa0241a61bd40
[207345.495731] ffffffff8b50b46d 00000000fffffffb ffffffff8c267150 0000000000000001
[207345.495731] Call Trace:
[207345.495731] [<ffffffffc09ca790>] ? kvm_vcpu_block+0x300/0x300 [kvm]
[207345.495732] [<ffffffff8b50b46d>] on_each_cpu+0x2d/0x60
[207345.495732] [<ffffffffc09c941f>] kvm_reboot+0x2f/0x40 [kvm]
[207345.495732] [<ffffffff8b4a4eba>] notifier_call_chain+0x4a/0x70
[207345.495733] [<ffffffff8b4a51f7>] __blocking_notifier_call_chain+0x47/0x60
[207345.495733] [<ffffffff8b4a5226>] blocking_notifier_call_chain+0x16/0x20
[207345.495734] [<ffffffff8b4a64bd>] kernel_restart_prepare+0x1d/0x40
[207345.495734] [<ffffffff8b4a6582>] kernel_restart+0x12/0x60
[207345.495734] [<ffffffff8b4a6902>] SYSC_reboot+0x202/0x220
[207345.495735] [<ffffffff8b63341c>] ? vfs_writev+0x3c/0x50
[207345.495735] [<ffffffff8b633491>] ? do_writev+0x61/0xf0
[207345.495735] [<ffffffff8b4a696e>] SyS_reboot+0xe/0x10
[207345.495736] [<ffffffff8bc9a876>] entry_SYSCALL_64_fastpath+0x1e/0xa8
[207345.495736] Code: d2 e8 3f 94 33 00 3b 05 ed 3a e5 00 89 c1 0f 8d 99 fe ff ff 48 98 49 8b 55 00 48 03 14 c5 60 c4 35 8c 8b 42 18 a8 01 74 09 f3 90 <8b> 42 18 a8 01 75 f7 eb bf 0f b6 4d d0 4c 89 fa 4c 89 f6
44 89
For CPU1
[207345.495711] NMI backtrace for cpu 1
[207345.495712] CPU: 1 PID: 1247 Comm: ftdc Tainted: G W L 4.8.0-53-generic #56~16.04.1-Ubuntu
[207345.495712] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
[207345.495713] task: ffffa024db476ac0 task.stack: ffffa024d83a4000
[207345.495713] RIP: 0010:[<ffffffff8b50b336>] [<ffffffff8b50b336>] smp_call_function_many+0x1f6/0x250
[207345.495714] RSP: 0018:ffffa024d83a7b38 EFLAGS: 00000202
[207345.495714] RAX: 0000000000000003 RBX: 0000000000000200 RCX: 0000000000000003
[207345.495714] RDX: ffffa024e659d380 RSI: 0000000000000200 RDI: ffffa024e649a288
[207345.495715] RBP: ffffa024d83a7b70 R08: 0000000000000000 R09: 000000000000000d
[207345.495715] R10: 0000000000000008 R11: ffffa024e649a288 R12: ffffa024e649a288
[207345.495716] R13: ffffa024e649a280 R14: ffffffff8b472400 R15: ffffa024d83a7b80
[207345.495716] FS: 00007f871ddd2700(0000) GS:ffffa024e6480000(0000) knlGS:0000000000000000
[207345.495716] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[207345.495717] CR2: 00007f95bc40323f CR3: 0000000258e11000 CR4: 00000000003406e0
[207345.495717] Stack:
[207345.495718] 000000000001a240 0100000000000001 ffffa024d3ebf800 ffffffffffffffff
[207345.495718] ffffa024d3ebfad8 0000000000000000 ffffffffffffffff ffffa024d83a7bb8
[207345.495718] ffffffff8b472865 ffffa024d3ebf800 0000000000000000 ffffffffffffffff
[207345.495719] Call Trace:
[207345.495719] [<ffffffff8b472865>] native_flush_tlb_others+0x65/0x130
[207345.495720] [<ffffffff8b472a43>] flush_tlb_mm_range+0x63/0x150
[207345.495720] [<ffffffff8b5d62b4>] tlb_flush_mmu_tlbonly+0x64/0xd0
[207345.495720] [<ffffffff8b5d75b2>] tlb_flush_mmu+0x12/0x20
[207345.495721] [<ffffffff8b61595d>] zap_huge_pmd+0x20d/0x3b0
[207345.495721] [<ffffffff8b5d9168>] unmap_page_range+0x928/0x940
[207345.495721] [<ffffffff8b47fc92>] ? mmput+0x12/0x130
[207345.495722] [<ffffffff8b5d91fd>] unmap_single_vma+0x7d/0xe0
[207345.495722] [<ffffffff8b5d9668>] zap_page_range+0xc8/0x140
[207345.495723] [<ffffffff8b5ef47e>] SyS_madvise+0x43e/0x930
[207345.495723] [<ffffffff8bc9a876>] entry_SYSCALL_64_fastpath+0x1e/0xa8
[207345.495724] Code: d2 e8 3f 94 33 00 3b 05 ed 3a e5 00 89 c1 0f 8d 99 fe ff ff 48 98 49 8b 55 00 48 03 14 c5 60 c4 35 8c 8b 42 18 a8 01 74 09 f3 90 <8b> 42 18 a8 01 75 f7 eb bf 0f b6 4d d0 4c 89 fa 4c 89 f6
44 89
and at last, CPU2
[207330.487609] 4c 89 fa 4c 89 f6 44 89
[207345.495645] sysrq: SysRq : Show backtrace of all active CPUs
[207345.495648] Sending NMI to all CPUs:
[207345.495699] NMI backtrace for cpu 2
[207345.495699] CPU: 2 PID: 15699 Comm: bash Tainted: G W L 4.8.0-53-generic #56~16.04.1-Ubuntu
[207345.495699] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
[207345.495700] task: ffffa02409d30f40 task.stack: ffffa02409dfc000
[207345.495700] RIP: 0010:[<ffffffff8b83c3b0>] [<ffffffff8b83c3b0>] delay_tsc+0x0/0x60
[207345.495701] RSP: 0018:ffffa02409dffe08 EFLAGS: 00000a07
[207345.495701] RAX: 000000007c3cc000 RBX: 0000000000002710 RCX: 00000000014b0e00
[207345.495702] RDX: 0000000000290d14 RSI: 0000000000000200 RDI: 0000000000290d15
[207345.495702] RBP: ffffa02409dffe10 R08: 0000000000000000 R09: 0000000000000006
[207345.495702] R10: 0000000000000001 R11: 0000000000011bf4 R12: 0000000000000004
[207345.495703] R13: 0000000000000000 R14: ffffffff8c2c1fe0 R15: 0000000000000000
[207345.495703] FS: 00007ff3a9e23700(0000) GS:ffffa024e6500000(0000) knlGS:0000000000000000
[207345.495704] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[207345.495704] CR2: 00000000009a5008 CR3: 0000000189dae000 CR4: 00000000003406e0
[207345.495704] Stack:
[207345.495705] ffffffff8b83c32b ffffa02409dffe28 ffffffff8b833141 000000000000006c
[207345.495705] ffffa02409dffe38 ffffffff8b456019 ffffa02409dffe48 ffffffff8b93e6e3
[207345.495706] ffffa02409dffe78 ffffffff8b93ed9a 0000000000000002 fffffffffffffffb
[207345.495706] Call Trace:
[207345.495706] [<ffffffff8b83c32b>] ? __const_udelay+0x2b/0x30
[207345.495707] [<ffffffff8b833141>] nmi_trigger_all_cpu_backtrace+0xc1/0x150
[207345.495707] [<ffffffff8b456019>] arch_trigger_all_cpu_backtrace+0x19/0x20
[207345.495707] [<ffffffff8b93e6e3>] sysrq_handle_showallcpus+0x13/0x20
[207345.495708] [<ffffffff8b93ed9a>] __handle_sysrq+0xea/0x140
[207345.495708] [<ffffffff8b93f21f>] write_sysrq_trigger+0x2f/0x40
[207345.495709] [<ffffffff8b6a6872>] proc_reg_write+0x42/0x70
[207345.495709] [<ffffffff8b632748>] __vfs_write+0x18/0x40
[207345.495709] [<ffffffff8b632e98>] vfs_write+0xb8/0x1b0
[207345.495710] [<ffffffff8b6342f5>] SyS_write+0x55/0xc0
[207345.495710] [<ffffffff8bc9a876>] entry_SYSCALL_64_fastpath+0x1e/0xa8
[207345.495711] Code: 12 48 c1 e2 06 48 89 e5 48 c1 e0 02 48 29 ca f7 e2 48 8d 7a 01 ff 15 b8 59 a7 00 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 <0f> 1f 44 00 00 55 48 89 e5 65 44 8b 05 27 de 7c 74 0f ae e8 0f
Here's the cat /proc/1160/task/1247/stat
gives me:
1247 (ftdc) R 1 1160 1160 0 -1 4194368 3495 0 0 0 33464 4158293 0 0 20 0 4 0 645 1763782656 173550 18446744073709551615 94481603162112 94481648347376 140722953733664 140218298338104 140218408507335 256 8405507 6145 1260 0 0 0 -1 1 0 0 1 0 0 94481648352704 94481650153520 94481669767168 140722953735785 140722953735827 140722953735827 140722953736168 0
Best Answer
What you have is a multithreaded application in which one thread appears to have hit a kernel bug.
Some analysis of the bug
You have tried to shut down the process
mongod
with ID 1160. The main thread with ID 1160 is in a zombie state waiting for the other threads in the process to die.The thread
ftdc
with ID 1247 has hit a kernel bug at some point when calling themadvise
system call which ended up in an infinite loop.The kernel has a watchdog which noticed the stuck thread and logged a stacktrace to the kernel log. The stacktrace included the name of the thread. Because the name of the thread and the process were different in this case the connection between the two was not immediately obvious from the stacktrace.
That thread was likely stuck in that state before you even tried to shutdown
mongod
in the first place.When you later ran
echo l > /proc/sysrq-trigger
a stacktrace for the stuck thread was logged again. The two stacktraces are entirely identical, so it may very well have been stuck in the same place all along.Reporting the bug
What you need to do is file a bug against the kernel. Remember to include the log output from the first time the watchdog detected that the thread was stuck.
Rebooting the system
In order to get this system back into a good state you will have to reboot. And there is a significant risk that a clean shutdown won't be possible.
If you attempt a clean shutdown you may need physical access to the machine in order to reset it unless you have a way to remotely power cycle the machine.
You can attempt an unclean reboot with
echo b > /proc/sysrq-trigger
which is about as disruptive as yanking the power from the machine. It will avoid the scenario where an attempted clean shutdown gets stuck and you can no longer ssh to the machine.Whatever you do expect a file system check to be needed during boot. So before attempting to shut down the machine in any way you should stop services writing important data to disk and run a
sync
command.There is a risk a
sync
command will get stuck. However since the stacktrace of the stuck process doesn't include anything file system or I/O related I consider that risk to be minor.There is also a risk you will need physical access to the machine to get it through the boot due to file system inconsistencies. The probability of that is however less than the probability that an attempted clean shutdown will get stuck.