NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mysqld:10435]

I have had a crash.
I couldn’t access remotely, and after 4 days I could access to the computer, it was irresponsive, I did a hardware reboot and when I lloked at the logs I found It has beeen logging during the full 4 days.


2018-10-31T21:50:02.338301+01:00 tutatis systemd[10365]: Stopped target Basic System.
2018-10-31T21:50:02.338593+01:00 tutatis systemd[10365]: Stopped target Sockets.
2018-10-31T21:50:02.338874+01:00 tutatis systemd[10365]: Stopped target Timers.
2018-10-31T21:50:02.339190+01:00 tutatis systemd[10365]: Stopped target Paths.
2018-10-31T21:50:02.348416+01:00 tutatis systemd[10365]: Received SIGRTMIN+24 from PID 10401 (kill).
2018-10-31T21:50:02.351498+01:00 tutatis systemd: pam_unix(systemd-user:session): session closed for user wwwrun
2018-10-31T21:50:02.353029+01:00 tutatis systemd[1]: Stopped User Manager for UID 30.
2018-10-31T21:50:02.353899+01:00 tutatis systemd[1]: Removed slice User Slice of wwwrun.

2018-10-31T21:50:05.696180+01:00 tutatis CRON[10348]: pam_unix(crond:session): session closed for user fperal
2018-10-31T21:56:02.500682+01:00 tutatis kernel: [221711.648107] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mysqld:10435]

2018-10-31T21:56:02.500699+01:00 tutatis kernel: [221711.648113] Modules linked in: parport_pc lp cdc_ether usbnet r8152 st nls_iso8859_1 nls_cp437 vfat fat uas usb_storage joydev fuse ctr ccm af_packet iscsi_ibft iscsi_boot_sysfs hwmon_vid msr arc4 rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211 cfg80211 crc_ccitt rfkill snd_hda_codec_hdmi snd_hda_codec_realtek kvm snd_hda_codec_generic snd_hda_intel snd_hda_codec irqbypass snd_hda_core snd_hwdep snd_pcm snd_seq snd_seq_device snd_timer crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drbg ansi_cprng snd aesni_intel aes_x86_64 lrw r8169 soundcore gf128mul glue_helper mii ablk_helper ppdev shpchp cryptd i2c_piix4 acpi_cpufreq fjes parport processor pcspkr nfsd auth_rpcgss nfs_acl lockd grace sunrpc ext4 crc16 jbd2 mbcache raid1 raid10 raid0 md_mod sr_mod cdrom sd_mod hid_generic usbhid nouveau(O) crc32c_intel video mxm_wmi i2c_algo_bit drm_kms_helper(O) syscopyarea sysfillrect sysimgblt fb_sys_fops ttm(O) xhci_pci xhci_hcd drm(O) r8168(O) ahci usbcore libahci usb_common wmi button ata_generic pata_atiixp libata sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua scsi_mod autofs4 [last unloaded: parport_pc]
2018-10-31T21:56:02.500701+01:00 tutatis kernel: [221711.648208] CPU: 7 PID: 10435 Comm: mysqld Tainted: G       O 4.4.155-68-default #1
2018-10-31T21:56:02.500702+01:00 tutatis kernel: [221711.648211] Hardware name: MSI MS-7A33/X370 GAMING PRO (MS-7A33), BIOS 4.00 03/20/2017
2018-10-31T21:56:02.500703+01:00 tutatis kernel: [221711.648214] task: ffff8800d8678280 ti: ffff8800dc9c0000 task.ti: ffff8800dc9c0000
2018-10-31T21:56:02.500704+01:00 tutatis kernel: [221711.648217] RIP: 0010:<ffffffff81112875>]  <ffffffff81112875>] smp_call_function_many+0x205/0x260
2018-10-31T21:56:02.500705+01:00 tutatis kernel: [221711.648225] RSP: 0018:ffff8800dc9c3ce8  EFLAGS: 00000202
2018-10-31T21:56:02.500706+01:00 tutatis kernel: [221711.648228] RAX: 0000000000000003 RBX: ffffffff810714e0 RCX: ffffe8ffffd82080
2018-10-31T21:56:02.500721+01:00 tutatis kernel: [221711.648230] RDX: 000000000000000e RSI: 0000000000000010 RDI: ffff8801862de0a0
2018-10-31T21:56:02.500722+01:00 tutatis kernel: [221711.648232] RBP: ffff8800dc9c3d30 R08: 0000000000004100 R09: 0000000000000000
2018-10-31T21:56:02.500723+01:00 tutatis kernel: [221711.648234] R10: 0000000000000000 R11: 0000000000000206 R12: 0000000000000010
2018-10-31T21:56:02.500724+01:00 tutatis kernel: [221711.648236] R13: 0000000000000001 R14: 00000000000175c0 R15: ffff88041edd7600
2018-10-31T21:56:02.500725+01:00 tutatis kernel: [221711.648240] FS:  00007f2f8bf24700(0000) GS:ffff88041edc0000(0000) knlGS:00000000f7541940
2018-10-31T21:56:02.500726+01:00 tutatis kernel: [221711.648242] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
2018-10-31T21:56:02.500727+01:00 tutatis kernel: [221711.648244] CR2: 00007fcdea9ab000 CR3: 0000000401d6e000 CR4: 00000000003406f0
2018-10-31T21:56:02.500728+01:00 tutatis kernel: [221711.648246] Stack:
2018-10-31T21:56:02.500729+01:00 tutatis kernel: [221711.648248]  000000018bf1f000 0000000000000007 00007f2f8bf1f000 00007f2f8bf1f000
2018-10-31T21:56:02.500730+01:00 tutatis kernel: [221711.648251]  00007f2f8bf18000 ffff880407cc8040 ffff880407cc83b0 00007f2f8bf1f000
2018-10-31T21:56:02.500731+01:00 tutatis kernel: [221711.648255]  ffffffff810719c9 ffff880407cc8040 00007f2f8bf18000 00007f2f8bf1f000
2018-10-31T21:56:02.500732+01:00 tutatis kernel: [221711.648258] Call Trace:
2018-10-31T21:56:02.500733+01:00 tutatis kernel: [221711.648274]  <ffffffff810719c9>] native_flush_tlb_others+0x119/0x1a0
2018-10-31T21:56:02.500734+01:00 tutatis kernel: [221711.648283]  <ffffffff81071af6>] flush_tlb_mm_range+0xa6/0x190
2018-10-31T21:56:02.500735+01:00 tutatis kernel: [221711.648291]  <ffffffff811c87d6>] tlb_flush_mmu_tlbonly+0x66/0xc0
2018-10-31T21:56:02.500736+01:00 tutatis kernel: [221711.648299]  <ffffffff811c96bf>] tlb_finish_mmu+0xf/0x40
2018-10-31T21:56:02.500737+01:00 tutatis kernel: [221711.648305]  <ffffffff811cb511>] zap_page_range+0xc1/0xf0
2018-10-31T21:56:02.500737+01:00 tutatis kernel: [221711.648313]  <ffffffff811df97a>] SyS_madvise+0x35a/0x770
2018-10-31T21:56:02.500738+01:00 tutatis kernel: [221711.648322]  <ffffffff816455c7>] entry_SYSCALL_64_fastpath+0x22/0xbb
2018-10-31T21:56:02.500739+01:00 tutatis kernel: [221711.651597] DWARF2 unwinder stuck at entry_SYSCALL_64_fastpath+0x22/0xbb
2018-10-31T21:56:02.500740+01:00 tutatis kernel: [221711.651598]
2018-10-31T21:56:02.500741+01:00 tutatis kernel: [221711.651600] Leftover inexact backtrace:

2018-10-31T21:56:02.500742+01:00 tutatis kernel: [221711.651600]
2018-10-31T21:56:02.500743+01:00 tutatis kernel: [221711.651604] Code: 24 00 3b 05 b2 61 e5 00 89 c2 0f 8d 8f fe ff ff 48 98 49 8b 0f 48 03 0c c5 40 43 f5 81 8b 41 18 a8 01 74 09 f3 90 8b 41 18 a8 01 <75> f7 eb bd 0f b6 4c 24 04 48 83 c4 10 48 89 de 5b 48 89 ea 44
2018-10-31T21:56:30.500664+01:00 tutatis kernel: [221739.648562] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mysqld:10435]
2018-10-31T21:56:30.500681+01:00 tutatis kernel: [221739.648568] Modules linked in: parport_pc lp cdc_ether usbnet r8152 st nls_iso8859_1 nls_cp437 vfat fat uas usb_storage joydev fuse ctr ccm af_packet iscsi_ibft iscsi_boot_sysfs hwmon_vid msr arc4 rt2800usb rt2x00usb rt2800lib rt2x00lib mac80211 cfg80211 crc_ccitt rfkill snd_hda_codec_hdmi snd_hda_codec_realtek kvm snd_hda_codec_generic snd_hda_intel snd_hda_codec irqbypass snd_hda_core snd_hwdep snd_pcm snd_seq snd_seq_device snd_timer crct10dif_pclmul crc32_pclmul ghash_clmulni_intel drbg ansi_cprng snd aesni_intel aes_x86_64 lrw r8169 soundcore gf128mul glue_helper mii ablk_helper ppdev shpchp cryptd i2c_piix4 acpi_cpufreq fjes parport processor pcspkr nfsd auth_rpcgss nfs_acl lockd grace sunrpc ext4 crc16 jbd2 mbcache raid1 raid10 raid0 md_mod sr_mod cdrom sd_mod hid_generic usbhid nouveau(O) crc32c_intel video mxm_wmi i2c_algo_bit drm_kms_helper(O) syscopyarea sysfillrect sysimgblt fb_sys_fops ttm(O) xhci_pci xhci_hcd drm(O) r8168(O) ahci usbcore libahci usb_common wmi button ata_generic pat



The problem begins at 2018-10-31T21:56:02 and … is related to mysql ?



tutatis:~ # cat /var/log/messages |grep NMI
2018-10-31T21:56:02.500682+01:00 tutatis kernel: [221711.648107] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mysqld:10435]
2018-10-31T21:56:30.500664+01:00 tutatis kernel: [221739.648562] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mysqld:10435]
2018-10-31T21:56:58.500672+01:00 tutatis kernel: [221767.649018] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mysqld:10435]
2018-10-31T21:57:26.500671+01:00 tutatis kernel: [221795.649474] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 23s! [mysqld:10435]
2018-10-31T21:57:54.500664+01:00 tutatis kernel: [221823.649930] NMI watchdog: BUG: soft lockup - CPU#7 stuck for 22s! [mysqld:10435]



have found documentation of similar errors, and this explanation
https://www.suse.com/es-es/support/kb/doc/?id=7017652

I don’t know if it something related with mysqld or just a problem with the kernel and mysqld whas just running in thet moment.

I have increase the watchdog threshols to 25:

tutatis:~ # sysctl -a |grep watchdog

kernel.nmi_watchdog = 1

kernel.soft_watchdog = 1

kernel.watchdog = 1

kernel.watchdog_cpumask = 0-15

kernel.watchdog_thresh = 10

tutatis:~ # sysctl -w kernel.watchdog_thresh=25

kernel.watchdog_thresh = 25

tutatis:~ #


And I’m waiting… but I don’t really know the cause of the error or if it will repeat or not.

any help?
thanks

The hardware is

AMD RYZEN 7 1700 ( I have checked, it is not one of the pre-25 week)
G.Skill Ripjaws V Series F4-3200C16D-16GVR 16 GB (2x8GB) DDR4
MSI X370 Gaming Pro

It seems it is related to a problem with Ryzen processors and C6 states.

I’m using this script

And I’m testing two ways:

First, I added a line in /etc/crontab

* * * * * root /usr/bin/stress-ng --cpu 2 --timeout 10

It makes 2 of the cores of the CPU (it has 16) work intensely 10 seconds each minute (I know is a quick fix, if it works)

Second, I tested the C6 states


tutatis:~/ZenStates-Linux # ./zenstates.py -l
P0 - Enabled - FID = 78 - DID = 8 - VID = 3A - Ratio = 30.00 - vCore = 1.18750
P1 - Enabled - FID = 87 - DID = A - VID = 50 - Ratio = 27.00 - vCore = 1.05000
P2 - Enabled - FID = 7C - DID = 10 - VID = 6C - Ratio = 15.50 - vCore = 0.87500
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Enabled
C6 State - Core - Enabled


and then I disabled C6

tutatis:~/ZenStates-Linux # ./zenstates.py --c6-disable
Disabling C6 state
tutatis:~/ZenStates-Linux # ./zenstates.py -l
P0 - Enabled - FID = 78 - DID = 8 - VID = 3A - Ratio = 30.00 - vCore = 1.18750
P1 - Enabled - FID = 87 - DID = A - VID = 50 - Ratio = 27.00 - vCore = 1.05000
P2 - Enabled - FID = 7C - DID = 10 - VID = 6C - Ratio = 15.50 - vCore = 0.87500
P3 - Disabled
P4 - Disabled
P5 - Disabled
P6 - Disabled
P7 - Disabled
C6 State - Package - Disabled
C6 State - Core - Disabled
tutatis:~/ZenStates-Linux #


And I added in /etc/crontab

@reboot /usr/bin/python /root/ZenStates-Linux/zenstates.py --c6-disable &


Let’s see if it works

It’s working
By the way. Maybe it’s not Ryzen fault but PSU’s fault.
See this article.
I’m using a Mars MPII750 Gaming.

It has been working 24/7 almost 3 weeks now so I can say I found the solution.
I remove the stress test in the cron and now it’s only working with the C6 disabled.
So I don’t really now if it is a problem of the CPU, of the PSU, or both, but disabling C6 solves the problem.

best regards