Plasma crash

JoeS · March 15, 2021, 3:50pm

I have been using TW for about 1.5 years in a VMWare VM with Windows 10 as the host.

Overall it has worked well and I have not encountered any major issues and generally have updated with zypper dup shortly after new updates are available.

Based on that success, I decide to install TW with dual boot.

As a dry run test, I first downloaded the latest TW ISO image and created a new VM where I installed it and then went through the process of installing other apps/packages and configuring it as I had done in the first VM.

For the last 2 weeks or so I have been having random problems and crashes in the newly built VM.

I use that VM all day long and don’t seem to run into any issues, however, sometime after the machine becomes idle, generally at night when I go to bed the machine crashes.

After it happened the first few times, I started 2 ssh sessions on the host computer, one running TOP and the other running journalctl --follow so that I could see what happens just before it crashes.

From that I have found 2 common things:

plasmashell was at 100% CPU
journalctl has an error like this: watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [plasmashell:37158]

Today there was some additional info:

watchdog: BUG: soft lockup - CPU#3 stuck for 23s! [plasmashell:37158]Modules linked in: binfmt_misc udp_diag tcp_diag inet_diag md4 cmac nls_utf8 cifs libarc4 dns_resolver fscache libdes af_packet nft_objref nf_conntrack_netbios_ns nf_conntrack_broadcast nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_tables ebtable_nat ebtable_broute ip6table_nat ip6table_mangle ip6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_mangle iptable_raw iptable_security ip_set nfnetlink ebtable_filter ebtables rfkill ip6table_filter ip6_tables iptable_filter ip_tables x_tables bpfilter vsock_loopback vmw_vsock_virtio_transport_common vmw_vsock_vmci_transport vsock snd_seq_midi snd_seq_midi_event snd_seq snd_ens1371 snd_ac97_codec ac97_bus gameport snd_rawmidi intel_rapl_msr intel_rapl_common kvm_intel snd_seq_device snd_pcm snd_timer snd kvm soundcore vmw_balloon irqbypass e1000 mptctl vmw_vmci joydev i2c_piix4 pcspkr efi_pstore
 tiny_power_button button ac nls_iso8859_1 nls_cp437 vfat fat fuse configfs hid_generic usbhid sr_mod cdrom ata_generic vmwgfx drm_kms_helper crct10dif_pclmul crc32_pclmul ghash_clmulni_intel xhci_pci xhci_pci_renesas syscopyarea sysfillrect sysimgblt xhci_hcd fb_sys_fops cec rc_core ttm uhci_hcd drm aesni_intel glue_helper crypto_simd ehci_pci ehci_hcd cryptd usbcore ata_piix serio_raw mptspi scsi_transport_spi mptscsih mptbase btrfs blake2b_generic libcrc32c crc32c_intel xor raid6_pq sg dm_multipath dm_mod scsi_dh_rdac scsi_dh_emc scsi_dh_alua msr efivarfs
CPU: 3 PID: 37158 Comm: plasmashell Tainted: G        W         5.11.4-1-default #1 openSUSE Tumbleweed
Hardware name: VMware, Inc. VMware7,1/440BX Desktop Reference Platform, BIOS VMW71.00V.16722896.B64.2008100651 08/10/2020
RIP: 0010:native_queued_spin_lock_slowpath+0x20/0x1d0
Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 0f 1f 44 00 00 ba 01 00 00 00 8b 07 85 c0 75 09 f0 0f b1 17 85 c0 75 f2 c3 f3 90 <eb> ed 81 fe 00 01 00 00 74 43 40 30 f6 85 f6 75 65 f0 0f ba 2f 08
RSP: 0000:ffffb9aa82f37c88 EFLAGS: 00000202
RAX: 0000000000000008 RBX: ffff9f5340000000 RCX: 0000000000000027
RDX: 0000000000000001 RSI: 0000000000000008 RDI: fffff8b601ee8028
RBP: ffffb9aa82f37cc0 R08: fffff8b601ee8028 R09: 0000000081000200
R10: 000000007ba00fff R11: 0000000000000000 R12: ffff9f53bba00000
R13: ffff9f54d3357080 R14: 00007f2a6e800000 R15: 0000000000000000
FS:  00007f2d543fb980(0000) GS:ffff9f5575ec0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2a6e7fffe0 CR3: 0000000107bea006 CR4: 00000000001706e0
Call Trace:

Occasionally, I am come back to the machine after it has been idle for a while and before it has crashed and I have observed that all the swap space has been consumed (seems like there must be some memory leak occurring, although it seems odd that I can work on the machine all day and not have that problem until the machine is idle at night).

The VM has 8 GB of ram allocated and a 8 GB swap partition also allocated.

Generally, when I leave the machine for the night there are only a few apps left running, Konsole, KSysGuard, and Chrome.

On the few occasions, where I get to the machine before it crashes I have tried to find out what was consuming all that swap space but have been unable to determine what it is.

The host machine is an Intel i7 with 32 GB of RAM.

The “stuck” CPU problem has never happened with the host machine (Windows 10), nor has it happened with any of my Windows 10 Virtual machines.

I am only using 2 additional repos, one for skype, and the other for VLC (following the TW instructions for adding it)

Here’s the repository list:

Repository priorities in effect:                                                                                         
      90 (raised priority)  :  1 repository
      99 (default priority) :  7 repositories

# | Alias                            | Name                      | Enabled | GPG Check | Refresh
--+----------------------------------+---------------------------+---------+-----------+--------
1 | download.opensuse.org-non-oss    | Main Repository (NON-OSS) | Yes     | (r ) Yes  | Yes
2 | download.opensuse.org-oss        | Main Repository (DEBUG)   | Yes     | (r ) Yes  | Yes
3 | download.opensuse.org-oss_1      | Main Repository (Sources) | Yes     | (r ) Yes  | Yes
4 | download.opensuse.org-oss_2      | Main Repository (OSS)     | Yes     | (r ) Yes  | Yes
5 | download.opensuse.org-tumbleweed | Main Update Repository    | Yes     | (r ) Yes  | Yes
6 | google-chrome                    | google-chrome             | Yes     | (r ) Yes  | Yes
7 | openSUSE-20210215-0              | openSUSE-20210215-0       | No      | ----      | ----
8 | skype-stable                     | skype (stable)            | Yes     | (r ) Yes  | Yes
9 | vlc                              | VLC                       | Yes     | (r ) Yes  | Yes

The above error occurred last night while running TW release 20210311 but similar errors were occurring with the builds from the last 2 weeks or so. I just did a zypper dup up to 20210312.

I would greatly appreciate any help in how to best go about debugging this issue.

Thanks!

hcvv · March 15, 2021, 3:58pm

I am afraid I can’t really help you (other will come to this thread), but your repository list is failing the critical information: the URLs.
Better post

zypper lr -d

JoeS · March 15, 2021, 4:24pm

Here you go…

# | Alias                            | Name                      | Enabled | GPG Check | Refresh | Priority | Type   | URI                                                                                  | Service
--+----------------------------------+---------------------------+---------+-----------+---------+----------+--------+--------------------------------------------------------------------------------------+--------
1 | download.opensuse.org-non-oss    | Main Repository (NON-OSS) | Yes     | (r ) Yes  | Yes     |   99     | rpm-md | http://download.opensuse.org/tumbleweed/repo/non-oss/                                |  
2 | download.opensuse.org-oss        | Main Repository (DEBUG)   | Yes     | (r ) Yes  | Yes     |   99     | rpm-md | http://download.opensuse.org/debug/tumbleweed/repo/oss/                              |  
3 | download.opensuse.org-oss_1      | Main Repository (Sources) | Yes     | (r ) Yes  | Yes     |   99     | rpm-md | http://download.opensuse.org/source/tumbleweed/repo/oss/                             |  
4 | download.opensuse.org-oss_2      | Main Repository (OSS)     | Yes     | (r ) Yes  | Yes     |   99     | rpm-md | http://download.opensuse.org/tumbleweed/repo/oss/                                    |  
5 | download.opensuse.org-tumbleweed | Main Update Repository    | Yes     | (r ) Yes  | Yes     |   99     | rpm-md | http://download.opensuse.org/update/tumbleweed/                                      |  
6 | google-chrome                    | google-chrome             | Yes     | (r ) Yes  | Yes     |   99     | rpm-md | http://dl.google.com/linux/chrome/rpm/stable/x86_64                                  |  
7 | openSUSE-20210215-0              | openSUSE-20210215-0       | No      | ----      | ----    |   99     | rpm-md | cd:/?devices=/dev/disk/by-id/ata-VMware_Virtual_IDE_CDROM_Drive_10000000000000000001 |  
8 | skype-stable                     | skype (stable)            | Yes     | (r ) Yes  | Yes     |   99     | rpm-md | https://repo.skype.com/rpm/stable/                                                   |  
9 | vlc                              | VLC                       | Yes     | (r ) Yes  | Yes     |   90     | rpm-md | http://download.videolan.org/pub/vlc/SuSE/Tumbleweed/                                |

JoeS · March 16, 2021, 12:46am

Is there a better place to post this issue than here?

karlmistelberger · March 16, 2021, 7:27am

Could be true. However I presume there are better strategies to troubleshoot the problem. Try a native install into a single partition and verify Tumbleweed and your configuration are working on your hardware.

JoeS · March 16, 2021, 3:08pm

Hi karlmistelberger,

Thanks for your reply.

It appears you may have missed a critical part of my 1st message…

I have been using TW for about 1.5 YEARS in a VMWare VM with Windows 10 as the host so there is no doubt that TW is compatible with my HW since it has been working for 1.5 YEARS.

The issue occurred when I did a fresh/clean install into a new VM, using the lastest ISO (as of about 2 weeks ago) and using the default partition layout and then I installed my apps which all come from the TW repositories except for Google Chrome and MS Skype. In that VM the plasma issues started occurring, and the only difference is that it was updated to a few days later version of TW, than the prior VM.

It seems pretty clear to me that if plasmashell is always at 100% CPU just before the system hangs/crashes that there is some bug in plasmashell when the system is idle or unused because as I also mentioned in my in initial reply I use the machine ALL DAY LONG and it does not have the issue until it is left to idle at night when I go to bed.

Another possibility seems to be that there is some processing TW schedules to run at night which triggers the bug in plasmashell which is why I am able to work all day long without issue.

The issue I’m having is trying to figure out what triggers plasmashell to have the problem which is why I posted here…

Joe

susejunky · March 16, 2021, 3:35pm

First of all: Be warned i’m no heavy user of virtual machines so i can only offer some “basic ideas”.

Here a few problem areas which come to my mind and which might be worth investigating (but probably you checked them already):

disk indexing (Plasma5: File search)
power save function (combined with a possible loss of network connection which in turn means a loss of SAMBA or NFS shares and other network related connections)
some other (timer driven) background task (=> systemctl list-timers) that “runs wild” (=> journalctl -p 3)

Regards

susejunky

karlmistelberger · March 16, 2021, 3:48pm

I am aware of this.

It seems pretty clear to me that if plasmashell is always at 100% CPU just before the system hangs/crashes that there is some bug in plasmashell when the system is idle or unused because as I also mentioned in my in initial reply I use the machine ALL DAY LONG and it does not have the issue until it is left to idle at night when I go to bed.

Another possibility seems to be that there is some processing TW schedules to run at night which triggers the bug in plasmashell which is why I am able to work all day long without issue.

The issue I’m having is trying to figure out what triggers plasmashell to have the problem which is why I posted here…
I am not sure. Your hardware is aging, you upgrade Tumbleweed, you change Tumbleweed configuration. My system was extremely shaky when running Leap: Konsole With "su -" Freezing The System - Applications - openSUSE Forums Since switching to Tumbleweed the system is rock solid. Plasmashell runs forever without any issues until Tumbleweed gets upgraded and rebooted. However some other hardware showed puzzling behavior: https://forums.opensuse.org/showthread.php/547390-Puzzling-Keyboard

Thus I think your conclusion is premature. A working virtual machine is great. A troubled one is difficult to debug. You may post the log of the crashed boot, presumably ‘journal -b -1 | susepaste’.

JoeS · March 17, 2021, 12:55am

Thanks susejuky, I appreciate your input.

The file search stuff is turned on (by default as I didn’t enable it). I have heard of people complaining about a performance hit but I have not noticed any hit in performance since I built the VM 1.5 years ago.

System is on a UPS as well as the router. If there is any hiccup power wises an alarm goes off so I would know (as well as it is logged).

Thanks for the systemctl list-timers command, that’s very convenient (also the --all option for loaded but inactive).

Looking at that list I dont see any that concern me and the last runtime also appears to be outside the range when the problems have occurred.

The key point is that main difference between this VM and the original one is that the original one has had TW zypper dup applied many times over the last 1.5 years whereas the new one was build with the 02/15/2021 ISO and only has seen a few updates. (I have not yet updated the original machine because the plan was for it to be replaced by this new one).

I’ve been doing this stuff for 30+ years (although TW is newer for me).

With TW being a rolling release and therefore most people have been doing the distribution upgrade for years, my initial thoughts were that most people are not doing a clean install and that could be a test situation that is not covered as well.

I have seen some discussion online about issues with the recent plasma 5.21 release (which came out right after I started the rebuild.

Looking at boomtower’s release status site: https://review.tumbleweed.boombatower.com/

It looks like the 03/11/2021 build had a much lower quality score (which had been declining the last several builds).

Yesterday I did the DUP up to the 03/12/2021 release which build the quality score up to 86 from 69 and last night was the first in a while where the problem did NOT occur. There were plasma updates in that so I’m wondering if that addressed the issue.

The 03/15/2021 build currently has a quality score of 92 but I have not updated to it yet as I wanted to see if the system makes it through the night again tonight.
If so, then I suspect that the fixed the issue in the 03/12 build.

Joe

JoeS · March 17, 2021, 1:10am

The TW install in question is not aging, it was a fresh install less than a month ago and has only had a few updates done to it.

Reading other forums it appears that I am not the only one that has seen the new KDE Plasma 5.21 issues with high CPU and crashes.
KDE Plasma 5.21 came out right after this system was built (I was waiting for it to be released) and a few updates later is when the issues started.

Yesterday I installed the 03/12 DUP and yesterday is the first night in a while that it did NOT have the plasma 100% CPU and then crash issue.

I see that 03/15 is available now but I am going to wait and see if it goes a second night without crashing.

Thanks

Joe

karlmistelberger · March 17, 2021, 7:11am

Plasma was shaky several years ago. KDE folks improved their stuff. Currently every new version is rock solid on all machines including the current version:

Operating System: openSUSE Tumbleweed 20210315
KDE Plasma Version: 5.21.2
KDE Frameworks Version: 5.79.0
Qt Version: 5.15.2
Kernel Version: 5.11.4-1-default
OS Type: 64-bit
Graphics Platform: X11
Processors: 8 × Intel® Core™ i7-6700K CPU @ 4.00GHz
Memory: 31.3 GiB of RAM
Graphics Processor: Radeon RX550/550 Series

susejunky · March 17, 2021, 9:01am

Sorry i’m no native English speaker so i probably phrased this not correctly: I was not thinking of problems with the power supply but of things like upower or KDEs powerdevil which - depending on your configuration - can put your system in to some sort of sleep mode. And in the past (?) there have been problems with network connections being lost in sleep mode and/or not being established on wake up.

It should be possible to update and use Tumbleweed “for ever”. However here is what i do (on my bare metal installations):

On major plasma updates i build a new /home
for my user (to get rid of configurations which might not properly fit the new version). - Once in a while i will do a fresh Tumbleweed installation.

Regards

susejunky

JoeS · March 17, 2021, 9:10pm

Thanks susejunky!

My systems are on 24/7 and don’t sleep, except for the monitors.

Do you consider the 5.21 plasma update a major one? The ‘start’ menu was the biggest noticeable difference to me.

plasmashell did not have the problem again last night.

Today I updated to the 03/16 release so we’ll see how it goes tonight.

JoeS · April 24, 2021, 9:31pm

I have continued to debug this plasmashell hang since I originally reported it as it has continued to occur almost day since then.

I wanted to provide an update.

Basically the symptoms are that when the VM is left idle for a while (generally overnight, but it has also occurred when left to sit for a few hours) when
I come back to the guest machine after it has been sitting idle, I find plasmashell @ 100% CPU and see CPU stuck messages in the journal.

I leave KSysGuard on the screen so that I can see what was going on when the problem occurs.

Neither the host (Windows 10) nor the guest has sleep mode enabled (only the monitors go to sleep).

I can ssh into the machine for a short period but am unable to kill -9 the plasmashell process an attempting to do so tends to hang things to the point of requiring a reboot.

What I have noticed since originally reporting is that the time on the system tray is generally the correct time indicating that problem problem occurs at the exact point that I come back to the machine. Since the odds of me returning to the machine at the exact point of failure dozens of times are quite low, I suspect that it is display related.

I was using the latest video driver (Intel) from my motherboard manufacturer but I did find a new version from Intel which I installed by the problem persisted.

VMWare also came out with a newer version which I also installed but the problem still persisted.

I tested out logging in as a different user and the VM DID NOT have the problem over a period of several days.

I logged back onto my user and the problem occurred that night.

I renamed my home directory and recreated it from the skeleton files and then slowly redid my preferences and configuration and did not have the problem until I set the wallpaper to a random slideshow. Within 30 minutes of using that the VM crashed with the issue.

I switched back to a static wallpaper but still had the crashes occur but less frequently.

I logged out and removed ~/.config/plasma-org.kde.plasma.desktop-appletsrc and then logged back in and have been running for several days without an issue so the problem appears to be related to the slideshow wallpaper. I have found issues reported in the past regarding this but it appeared that they were fixed. Since I had been running with the slideshow wallpaper prior to the problem for a long time I suspect it is a NEW problem.

I have a backup of the VM that has the issue from 03/12/2021 which was running the 03/08/2021 version of TW.
I renamed the VM files and copy of the VM to TEST and then brought up the TEST VM
It IS using the slideshow wallpaper option just like I was before this problem started and the TEST VM ran for days and it did NOT have the problem.

Since the newer updated version of the VM has the issue daily it seems pretty clear that something since then caused the problem.

Yesterday, I did a zypper dup on the TEST VM from 03/08/2021 to 04/22/2021 and today the TEST VM now has the exact same issue with the plasmashell hang so clearly something in the updates after 03/08/2021 is causing the problem.

One odd thing that I noticed on 04/23/2021 when I did the zypper dup for the VM that I originally reported the problem with is this:

        The following 32 packages are going to be reinstalled:
          kernel-firmware-all             20210315-1.1                 
          kernel-firmware-amdgpu     20210315-1.1
          kernel-firmware-ath10k       20210315-1.1
          kernel-firmware-ath11k       20210315-1.1
          kernel-firmware-atheros      20210315-1.1
          kernel-firmware-bluetooth    20210315-1.1
          kernel-firmware-bnx2          20210315-1.1
          kernel-firmware-brcm          20210315-1.1
          kernel-firmware-chelsio       20210315-1.1
          kernel-firmware-dpaa2        20210315-1.1
          kernel-firmware-i915          20210315-1.1
          kernel-firmware-intel          20210315-1.1                 
          kernel-firmware-iwlwifi      20210315-1.1
          kernel-firmware-liquidio     20210315-1.1
          kernel-firmware-marvell      20210315-1.1
          kernel-firmware-media        20210315-1.1
          kernel-firmware-mediatek     20210315-1.1
          kernel-firmware-mellanox     20210315-1.1
          kernel-firmware-mwifiex      20210315-1.1
          kernel-firmware-network      20210315-1.1
          kernel-firmware-nfp          20210315-1.1
          kernel-firmware-nvidia       20210315-1.1
          kernel-firmware-platform     20210315-1.1
          kernel-firmware-prestera     20210315-1.1
          kernel-firmware-qlogic       20210315-1.1
          kernel-firmware-radeon       20210315-1.1
          kernel-firmware-realtek      20210315-1.1
          kernel-firmware-serial       20210315-1.1
          kernel-firmware-sound        20210315-1.1
          kernel-firmware-ti           20210315-1.1
          kernel-firmware-ueagle       20210315-1.1
          kernel-firmware-usb-network  20210315-1.1

It seems odd to me that it is reinstalling the kernel-firmware packages from 03/15/2021. This is “close” to the time that this problem originally started to occur leading me to wonder if that is related to the problem.

Why would it need to reinstall those firmware packages again???

Since the instance of the VM from 03/08/2021 did NOT have the problem with the slideshow wallpaper until after it had also been updated it seems pretty clear that something in those updates is causing the problem.

Thoughts?

Where is the best place to report this issue?

Thanks!