Ubuntu – What’s causing Ubuntu to have BUG: soft lockup warning so often

mongodbUbuntu

EDIT: it turns out that this is not the only instances of the bug happening. It happens so often in my computer. Sometimes it involved another seemingly random process, such as: chromium-browser, teamviewer and mongod. I began to notice it because it crashed MongoDB database days ago. As far as today, there are at least three times this is happened. I have no problem before, when I use Ubuntu 14.04 LTS, my system is (DELL INSPIRON 3650). It is standard CPU, no-overclocking involved.

I have a ubuntu 16.04 with mongodb(3.4) installation. Several hours ago it spiked in its operation, consuming 100% of CPU resources.

Here's the result from top

top - 21:40:05 up 2 days,  8:30,  1 user,  load average: 17,08, 17,03, 17,01
Tasks: 174 total,  15 running, 153 sleeping,   0 stopped,   6 zombie
%Cpu(s):  0,0 us, 66,8 sy,  0,0 ni, 33,2 id,  0,0 wa,  0,0 hi,  0,0 si,  0,0 st
KiB Mem :  8117148 total,  5307248 free,   981712 used,  1828188 buff/cache
KiB Swap:   520188 total,   520188 free,        0 used.  6427752 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                            
 1160 mongodb   20   0       0      0      0 Z  99,7  0,0 627:44.03 mongod                             
14214 root      20   0   26176   1356   1168 R  99,7  0,0 147:03.56 systemctl                          
 3636 root      20   0  232068  37388  28740 S   0,3  0,5   1:04.03 Xorg   

I try kill the process with no luck, any kill -9 <MONGOD PID> can't kill it. I also can't reboot the system. It simply unresponsive. Below is result from sudo service mongod stop command

Failed to retrieve unit: Connection timed out
Failed to stop mongod.service: Connection timed out
See system logs and 'systemctl status mongod.service' for details.
Failed to get load state of mongod.service: Connection timed out

I still can ssh into the server, but nothing I can do to stop the mongod process. Can anyone help me?

ADDITIONAL NOTE

the pstree -p -s 1160 command gives me

systemd(1)───mongod(1160)─┬─{ftdc}(1247)
                          ├─{mongod}(1239)
                          └─{signalP.gThread}(1214)

as per tailf -100 /var/log/syslog command gives me more interestin result. It displays a repeated message, below is one of them:

Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505244] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ftdc:1247]
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505245] Modules linked in: rfcomm xt_multiport iptable_filter ip_tables x_tables rtsx_usb_ms bnep memstick binfmt_misc snd_hda_codec_hdmi intel_rapl x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel arc4 dcdbas dell_smm_hwmon kvm snd_hda_codec_realtek irqbypass snd_hda_codec_generic crct10dif_pclmul rtl8723be crc32_pclmul ghash_clmulni_intel snd_hda_intel aesni_intel snd_hda_codec btcoexist rtl8723_common aes_x86_64 snd_hda_core lrw joydev snd_hwdep glue_helper rtl_pci input_leds rtlwifi snd_pcm ablk_helper snd_seq_midi cryptd mac80211 snd_seq_midi_event snd_rawmidi intel_cstate btusb intel_rapl_perf btrtl snd_seq cfg80211 snd_seq_device snd_timer snd serio_raw soundcore mei_me mei shpchp hci_uart btbcm btqca btintel bluetooth mac_hid intel_lpss_acpi intel_lpss acpi_als kfifo_buf industrialio acpi_pad parport_pc ppdev lp parport autofs4 btrfs xor raid6_pq dm_mirror dm_region_hash dm_log rtsx_usb_sdmmc rtsx_usb hid_generic usbhid nouveau mxm_wmi i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt r8169 psmouse fb_sys_fops mii drm ahci libahci wmi pinctrl_sunrisepoint video pinctrl_intel i2c_hid hid fjes
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505277] CPU: 1 PID: 1247 Comm: ftdc Tainted: G        W    L  4.8.0-53-generic #56~16.04.1-Ubuntu
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505277] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505278] task: ffffa024db476ac0 task.stack: ffffa024d83a4000
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505278] RIP: 0010:[<ffffffff8b50b336>]  [<ffffffff8b50b336>] smp_call_function_many+0x1f6/0x250
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505281] RSP: 0018:ffffa024d83a7b38  EFLAGS: 00000202
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505281] RAX: 0000000000000003 RBX: 0000000000000200 RCX: 0000000000000003
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505282] RDX: ffffa024e659d380 RSI: 0000000000000200 RDI: ffffa024e649a288
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505282] RBP: ffffa024d83a7b70 R08: 0000000000000000 R09: 000000000000000d
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505282] R10: 0000000000000008 R11: ffffa024e649a288 R12: ffffa024e649a288
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505283] R13: ffffa024e649a280 R14: ffffffff8b472400 R15: ffffa024d83a7b80
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505284] FS:  00007f871ddd2700(0000) GS:ffffa024e6480000(0000) knlGS:0000000000000000
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505284] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505284] CR2: 00007f95bc40323f CR3: 0000000258e11000 CR4: 00000000003406e0
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505285] Stack:
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505285]  000000000001a240 0100000000000001 ffffa024d3ebf800 ffffffffffffffff
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505287]  ffffa024d3ebfad8 0000000000000000 ffffffffffffffff ffffa024d83a7bb8
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505288]  ffffffff8b472865 ffffa024d3ebf800 0000000000000000 ffffffffffffffff
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505289] Call Trace:
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505291]  [<ffffffff8b472865>] native_flush_tlb_others+0x65/0x130
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505292]  [<ffffffff8b472a43>] flush_tlb_mm_range+0x63/0x150
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505294]  [<ffffffff8b5d62b4>] tlb_flush_mmu_tlbonly+0x64/0xd0
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505295]  [<ffffffff8b5d75b2>] tlb_flush_mmu+0x12/0x20
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505297]  [<ffffffff8b61595d>] zap_huge_pmd+0x20d/0x3b0
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505298]  [<ffffffff8b5d9168>] unmap_page_range+0x928/0x940
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505299]  [<ffffffff8b47fc92>] ? mmput+0x12/0x130
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505301]  [<ffffffff8b5d91fd>] unmap_single_vma+0x7d/0xe0
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505302]  [<ffffffff8b5d9668>] zap_page_range+0xc8/0x140
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505304]  [<ffffffff8b5ef47e>] SyS_madvise+0x43e/0x930
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505305]  [<ffffffff8bc9a876>] entry_SYSCALL_64_fastpath+0x1e/0xa8
Jan 28 22:11:48 leony-Inspiron-3650 kernel: [205318.505306] Code: d2 e8 3f 94 33 00 3b 05 ed 3a e5 00 89 c1 0f 8d 99 fe ff ff 48 98 49 8b 55 00 48 03 14 c5 60 c4 35 8c 8b 42 18 a8 01 74 09 f3 90 <8b> 42 18 a8 01 75 f7 eb bf 0f b6 4d d0 4c 89 fa 4c 89 f6 44 89

Here's the output from echo l > /proc/sysrq-trigger This is for CPU3

[207345.496706] NMI backtrace for cpu 3
[207345.496707] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G        W    L  4.8.0-53-generic #56~16.04.1-Ubuntu
[207345.496707] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
[207345.496708] task: ffffa024dc428000 task.stack: ffffa024dc460000
[207345.496708] RIP: 0010:[<ffffffff8b4cf41a>]  [<ffffffff8b4cf41a>] native_queued_spin_lock_slowpath+0x17a/0x1a0
[207345.496708] RSP: 0018:ffffa024e6583b30  EFLAGS: 00000002
[207345.496709] RAX: 0000000000000101 RBX: 0000000000000092 RCX: 0000000000000001
[207345.496709] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffffa024d4111d08
[207345.496709] RBP: ffffa024e6583b30 R08: 0000000000000101 R09: 000000000000002a
[207345.496710] R10: 00000000ffffffff R11: 0000000000000000 R12: ffffa024d4111d08
[207345.496710] R13: ffffa024dc583a00 R14: ffffa024d4111c00 R15: ffffa024d4111c00
[207345.496711] FS:  0000000000000000(0000) GS:ffffa024e6580000(0000) knlGS:0000000000000000
[207345.496711] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[207345.496711] CR2: 00001372a3be0000 CR3: 0000000258e11000 CR4: 00000000003406e0
[207345.496712] Stack:
[207345.496712]  ffffa024e6583b48 ffffffff8bc9a7e7 000000000000002a ffffa024e6583b98
[207345.496712]  ffffffffc02f9dc3 ffffa024dc583580 0000000000000010 ffffa024e6583b98
[207345.496713]  ffffa024d4111c00 000000000000002a ffffa024d4110c00 ffffa024d4111c00
[207345.496713] Call Trace:
[207345.496713]  <IRQ> ^Ad [<ffffffff8bc9a7e7>] _raw_spin_lock_irqsave+0x37/0x3f
[207345.496714]  [<ffffffffc02f9dc3>] nvkm_fantog_update+0x43/0x110 [nouveau]
[207345.496714]  [<ffffffffc02f9ee8>] nvkm_fantog_set+0x38/0x40 [nouveau]
[207345.496714]  [<ffffffffc02f936f>] nvkm_fan_update+0xbf/0x200 [nouveau]
[207345.496715]  [<ffffffffc02f94e9>] nvkm_therm_fan_set+0x19/0x20 [nouveau]
[207345.496715]  [<ffffffffc02f8beb>] nvkm_therm_update+0x9b/0x2e0 [nouveau]
[207345.496715]  [<ffffffffc02f8e47>] nvkm_therm_alarm+0x17/0x20 [nouveau]
[207345.496716]  [<ffffffffc02fc0d0>] nvkm_timer_alarm_trigger+0x100/0x150 [nouveau]
[207345.496716]  [<ffffffffc02fc1ef>] nvkm_timer_alarm+0x7f/0xd0 [nouveau]
[207345.496716]  [<ffffffffc02f9e85>] nvkm_fantog_update+0x105/0x110 [nouveau]
[207345.496717]  [<ffffffffc02f9eaa>] nvkm_fantog_alarm+0x1a/0x20 [nouveau]
[207345.496717]  [<ffffffffc02fc0d0>] nvkm_timer_alarm_trigger+0x100/0x150 [nouveau]
[207345.496718]  [<ffffffffc02fc4f2>] nv04_timer_intr+0x62/0xb0 [nouveau]
[207345.496718]  [<ffffffffc02fbf77>] nvkm_timer_intr+0x17/0x20 [nouveau]
[207345.496718]  [<ffffffffc02aa7c7>] nvkm_subdev_intr+0x17/0x20 [nouveau]
[207345.496719]  [<ffffffffc02eea15>] nvkm_mc_intr+0xe5/0x190 [nouveau]
[207345.496719]  [<ffffffffc02f35f3>] nvkm_pci_intr+0x53/0x80 [nouveau]
[207345.496719]  [<ffffffff8b4e0011>] __handle_irq_event_percpu+0x81/0x1a0
[207345.496720]  [<ffffffff8b4e0162>] handle_irq_event_percpu+0x32/0x80
[207345.496720]  [<ffffffff8b4e01ee>] handle_irq_event+0x3e/0x60
[207345.496720]  [<ffffffff8b4e3bf0>] handle_edge_irq+0x80/0x150
[207345.496721]  [<ffffffff8b4302cd>] handle_irq+0x1d/0x30
[207345.496721]  [<ffffffff8bc9d0db>] do_IRQ+0x4b/0xd0
[207345.496721]  [<ffffffff8bc9b1c2>] common_interrupt+0x82/0x82
[207345.496722]  <EOI> ^Ad [<ffffffff8bb1934b>] ? cpuidle_enter_state+0x12b/0x2d0
[207345.496722]  [<ffffffff8bb19527>] cpuidle_enter+0x17/0x20
[207345.496722]  [<ffffffff8b4c7a0a>] call_cpuidle+0x2a/0x50
[207345.496723]  [<ffffffff8b4c7dee>] cpu_startup_entry+0x29e/0x350
[207345.496723]  [<ffffffff8b4518b1>] start_secondary+0x151/0x190
[207345.496724] Code: 41 39 c0 74 e6 4d 85 c9 c6 07 01 74 30 41 c7 41 08 01 00 00 00 e9 51 ff ff ff 83 fa 01 0f 84 af fe ff ff 8b 07 84 c0 74 08 f3 90 <8b> 07 84 c0 75 f8 b8 01 00 00 00 66 89 07 5d c3 f3 90 4c 8b 09

This is for CPU 0

[207345.495724] NMI backtrace for cpu 0
[207345.495725] CPU: 0 PID: 14214 Comm: systemctl Tainted: G        W    L  4.8.0-53-generic #56~16.04.1-Ubuntu
[207345.495725] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
[207345.495726] task: ffffa0241a56db80 task.stack: ffffa0241a618000
[207345.495726] RIP: 0010:[<ffffffff8b50b336>]  [<ffffffff8b50b336>] smp_call_function_many+0x1f6/0x250
[207345.495726] RSP: 0018:ffffa0241a61bce0  EFLAGS: 00000202
[207345.495727] RAX: 0000000000000003 RBX: 0000000000000200 RCX: 0000000000000003
[207345.495727] RDX: ffffa024e659cc68 RSI: 0000000000000200 RDI: ffffa024e641a288
[207345.495728] RBP: ffffa0241a61bd18 R08: 0000000000000000 R09: 000000000000000e
[207345.495728] R10: 0000000000000008 R11: ffffa024e641a288 R12: ffffa024e641a288
[207345.495728] R13: ffffa024e641a280 R14: ffffffffc09ca790 R15: 0000000000000000
[207345.495729] FS:  00007fe04de0f880(0000) GS:ffffa024e6400000(0000) knlGS:0000000000000000
[207345.495729] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[207345.495729] CR2: 000055f9604a6040 CR3: 000000019a651000 CR4: 00000000003406f0
[207345.495730] Stack:
[207345.495730]  000000000001a240 0100000000000001 00000000fffffffb ffffffffc09ca790
[207345.495730]  0000000000000000 0000000000000000 0000000000000000 ffffa0241a61bd40
[207345.495731]  ffffffff8b50b46d 00000000fffffffb ffffffff8c267150 0000000000000001
[207345.495731] Call Trace:
[207345.495731]  [<ffffffffc09ca790>] ? kvm_vcpu_block+0x300/0x300 [kvm]
[207345.495732]  [<ffffffff8b50b46d>] on_each_cpu+0x2d/0x60
[207345.495732]  [<ffffffffc09c941f>] kvm_reboot+0x2f/0x40 [kvm]
[207345.495732]  [<ffffffff8b4a4eba>] notifier_call_chain+0x4a/0x70
[207345.495733]  [<ffffffff8b4a51f7>] __blocking_notifier_call_chain+0x47/0x60
[207345.495733]  [<ffffffff8b4a5226>] blocking_notifier_call_chain+0x16/0x20
[207345.495734]  [<ffffffff8b4a64bd>] kernel_restart_prepare+0x1d/0x40
[207345.495734]  [<ffffffff8b4a6582>] kernel_restart+0x12/0x60
[207345.495734]  [<ffffffff8b4a6902>] SYSC_reboot+0x202/0x220
[207345.495735]  [<ffffffff8b63341c>] ? vfs_writev+0x3c/0x50
[207345.495735]  [<ffffffff8b633491>] ? do_writev+0x61/0xf0
[207345.495735]  [<ffffffff8b4a696e>] SyS_reboot+0xe/0x10
[207345.495736]  [<ffffffff8bc9a876>] entry_SYSCALL_64_fastpath+0x1e/0xa8
[207345.495736] Code: d2 e8 3f 94 33 00 3b 05 ed 3a e5 00 89 c1 0f 8d 99 fe ff ff 48 98 49 8b 55 00 48 03 14 c5 60 c4 35 8c 8b 42 18 a8 01 74 09 f3 90 <8b> 42 18 a8 01 75 f7 eb bf 0f b6 4d d0 4c 89 fa 4c 89 f6 
44 89 

For CPU1

[207345.495711] NMI backtrace for cpu 1
[207345.495712] CPU: 1 PID: 1247 Comm: ftdc Tainted: G        W    L  4.8.0-53-generic #56~16.04.1-Ubuntu
[207345.495712] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
[207345.495713] task: ffffa024db476ac0 task.stack: ffffa024d83a4000
[207345.495713] RIP: 0010:[<ffffffff8b50b336>]  [<ffffffff8b50b336>] smp_call_function_many+0x1f6/0x250
[207345.495714] RSP: 0018:ffffa024d83a7b38  EFLAGS: 00000202
[207345.495714] RAX: 0000000000000003 RBX: 0000000000000200 RCX: 0000000000000003
[207345.495714] RDX: ffffa024e659d380 RSI: 0000000000000200 RDI: ffffa024e649a288
[207345.495715] RBP: ffffa024d83a7b70 R08: 0000000000000000 R09: 000000000000000d
[207345.495715] R10: 0000000000000008 R11: ffffa024e649a288 R12: ffffa024e649a288
[207345.495716] R13: ffffa024e649a280 R14: ffffffff8b472400 R15: ffffa024d83a7b80
[207345.495716] FS:  00007f871ddd2700(0000) GS:ffffa024e6480000(0000) knlGS:0000000000000000
[207345.495716] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[207345.495717] CR2: 00007f95bc40323f CR3: 0000000258e11000 CR4: 00000000003406e0
[207345.495717] Stack:
[207345.495718]  000000000001a240 0100000000000001 ffffa024d3ebf800 ffffffffffffffff
[207345.495718]  ffffa024d3ebfad8 0000000000000000 ffffffffffffffff ffffa024d83a7bb8
[207345.495718]  ffffffff8b472865 ffffa024d3ebf800 0000000000000000 ffffffffffffffff
[207345.495719] Call Trace:
[207345.495719]  [<ffffffff8b472865>] native_flush_tlb_others+0x65/0x130
[207345.495720]  [<ffffffff8b472a43>] flush_tlb_mm_range+0x63/0x150
[207345.495720]  [<ffffffff8b5d62b4>] tlb_flush_mmu_tlbonly+0x64/0xd0
[207345.495720]  [<ffffffff8b5d75b2>] tlb_flush_mmu+0x12/0x20
[207345.495721]  [<ffffffff8b61595d>] zap_huge_pmd+0x20d/0x3b0
[207345.495721]  [<ffffffff8b5d9168>] unmap_page_range+0x928/0x940
[207345.495721]  [<ffffffff8b47fc92>] ? mmput+0x12/0x130
[207345.495722]  [<ffffffff8b5d91fd>] unmap_single_vma+0x7d/0xe0
[207345.495722]  [<ffffffff8b5d9668>] zap_page_range+0xc8/0x140
[207345.495723]  [<ffffffff8b5ef47e>] SyS_madvise+0x43e/0x930
[207345.495723]  [<ffffffff8bc9a876>] entry_SYSCALL_64_fastpath+0x1e/0xa8
[207345.495724] Code: d2 e8 3f 94 33 00 3b 05 ed 3a e5 00 89 c1 0f 8d 99 fe ff ff 48 98 49 8b 55 00 48 03 14 c5 60 c4 35 8c 8b 42 18 a8 01 74 09 f3 90 <8b> 42 18 a8 01 75 f7 eb bf 0f b6 4d d0 4c 89 fa 4c 89 f6 
44 89

and at last, CPU2

[207330.487609] 4c 89 fa 4c 89 f6 44 89 
[207345.495645] sysrq: SysRq : Show backtrace of all active CPUs
[207345.495648] Sending NMI to all CPUs:
[207345.495699] NMI backtrace for cpu 2
[207345.495699] CPU: 2 PID: 15699 Comm: bash Tainted: G        W    L  4.8.0-53-generic #56~16.04.1-Ubuntu
[207345.495699] Hardware name: Dell Inc. Inspiron 3650/0C2XKD, BIOS 2.0.1 09/03/2015
[207345.495700] task: ffffa02409d30f40 task.stack: ffffa02409dfc000
[207345.495700] RIP: 0010:[<ffffffff8b83c3b0>]  [<ffffffff8b83c3b0>] delay_tsc+0x0/0x60
[207345.495701] RSP: 0018:ffffa02409dffe08  EFLAGS: 00000a07
[207345.495701] RAX: 000000007c3cc000 RBX: 0000000000002710 RCX: 00000000014b0e00
[207345.495702] RDX: 0000000000290d14 RSI: 0000000000000200 RDI: 0000000000290d15
[207345.495702] RBP: ffffa02409dffe10 R08: 0000000000000000 R09: 0000000000000006
[207345.495702] R10: 0000000000000001 R11: 0000000000011bf4 R12: 0000000000000004
[207345.495703] R13: 0000000000000000 R14: ffffffff8c2c1fe0 R15: 0000000000000000
[207345.495703] FS:  00007ff3a9e23700(0000) GS:ffffa024e6500000(0000) knlGS:0000000000000000
[207345.495704] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[207345.495704] CR2: 00000000009a5008 CR3: 0000000189dae000 CR4: 00000000003406e0
[207345.495704] Stack:
[207345.495705]  ffffffff8b83c32b ffffa02409dffe28 ffffffff8b833141 000000000000006c
[207345.495705]  ffffa02409dffe38 ffffffff8b456019 ffffa02409dffe48 ffffffff8b93e6e3
[207345.495706]  ffffa02409dffe78 ffffffff8b93ed9a 0000000000000002 fffffffffffffffb
[207345.495706] Call Trace:
[207345.495706]  [<ffffffff8b83c32b>] ? __const_udelay+0x2b/0x30
[207345.495707]  [<ffffffff8b833141>] nmi_trigger_all_cpu_backtrace+0xc1/0x150
[207345.495707]  [<ffffffff8b456019>] arch_trigger_all_cpu_backtrace+0x19/0x20
[207345.495707]  [<ffffffff8b93e6e3>] sysrq_handle_showallcpus+0x13/0x20
[207345.495708]  [<ffffffff8b93ed9a>] __handle_sysrq+0xea/0x140
[207345.495708]  [<ffffffff8b93f21f>] write_sysrq_trigger+0x2f/0x40
[207345.495709]  [<ffffffff8b6a6872>] proc_reg_write+0x42/0x70
[207345.495709]  [<ffffffff8b632748>] __vfs_write+0x18/0x40
[207345.495709]  [<ffffffff8b632e98>] vfs_write+0xb8/0x1b0
[207345.495710]  [<ffffffff8b6342f5>] SyS_write+0x55/0xc0
[207345.495710]  [<ffffffff8bc9a876>] entry_SYSCALL_64_fastpath+0x1e/0xa8
[207345.495711] Code: 12 48 c1 e2 06 48 89 e5 48 c1 e0 02 48 29 ca f7 e2 48 8d 7a 01 ff 15 b8 59 a7 00 5d c3 0f 1f 40 00 66 2e 0f 1f 84 00 00 00 00 00 <0f> 1f 44 00 00 55 48 89 e5 65 44 8b 05 27 de 7c 74 0f ae e8 0f 

Here's the cat /proc/1160/task/1247/stat gives me:

1247 (ftdc) R 1 1160 1160 0 -1 4194368 3495 0 0 0 33464 4158293 0 0 20 0 4 0 645 1763782656 173550 18446744073709551615 94481603162112 94481648347376 140722953733664 140218298338104 140218408507335 256 8405507 6145 1260 0 0 0 -1 1 0 0 1 0 0 94481648352704 94481650153520 94481669767168 140722953735785 140722953735827 140722953735827 140722953736168 0

Best Answer

What you have is a multithreaded application in which one thread appears to have hit a kernel bug.

Some analysis of the bug

You have tried to shut down the process mongod with ID 1160. The main thread with ID 1160 is in a zombie state waiting for the other threads in the process to die.

The thread ftdc with ID 1247 has hit a kernel bug at some point when calling the madvise system call which ended up in an infinite loop.

The kernel has a watchdog which noticed the stuck thread and logged a stacktrace to the kernel log. The stacktrace included the name of the thread. Because the name of the thread and the process were different in this case the connection between the two was not immediately obvious from the stacktrace.

That thread was likely stuck in that state before you even tried to shutdown mongod in the first place.

When you later ran echo l > /proc/sysrq-trigger a stacktrace for the stuck thread was logged again. The two stacktraces are entirely identical, so it may very well have been stuck in the same place all along.

Reporting the bug

What you need to do is file a bug against the kernel. Remember to include the log output from the first time the watchdog detected that the thread was stuck.

Rebooting the system

In order to get this system back into a good state you will have to reboot. And there is a significant risk that a clean shutdown won't be possible.

If you attempt a clean shutdown you may need physical access to the machine in order to reset it unless you have a way to remotely power cycle the machine.

You can attempt an unclean reboot with echo b > /proc/sysrq-trigger which is about as disruptive as yanking the power from the machine. It will avoid the scenario where an attempted clean shutdown gets stuck and you can no longer ssh to the machine.

Whatever you do expect a file system check to be needed during boot. So before attempting to shut down the machine in any way you should stop services writing important data to disk and run a sync command.

There is a risk a sync command will get stuck. However since the stacktrace of the stuck process doesn't include anything file system or I/O related I consider that risk to be minor.

There is also a risk you will need physical access to the machine to get it through the boot due to file system inconsistencies. The probability of that is however less than the probability that an attempted clean shutdown will get stuck.