Machine died suddenly, on reboot no Panel. How to debug?

Hi!

Have a remote Leap 15.1 KDE box running fine for years. Today it suddenly died away (remote VNC viewer window froze). I turned off power and on again. Did not come back (although configured in BIOS to go to same mode as before power loss…). Even wake on LAN did not bring it back.

Someone pressed the Power button and now the box is back to life, but the Panel on the desktop was gone (added a opensuse standard panel without problems).

I would like to learn what caused this crash, but have no idea where to start looking for old logs that could shed some light on this…

Any hint highly appreciated…

If you have a persistent journal configured (/var/log/journal/ exists),

journalctl -b -2

will provide the paged journal for the boot prior to the previous boot, which might contain clue(s). KSystemLog and ~/.xsession-errors are other avenues.

Ever since they switched from init to systemd at startup, there are times when the panel sometimes does not start or it aborted. journalctl seems not to show a reason for that, I suspect that a depends in the systemd is not right but not critical every time. I never see it on CPU’s with 2 cores but with 4 or more cores there is a chance for something to start in the wrong order and thus no panel.

I found that if I reboot - the panel will re-appear - I have a terminal on my desktop so I can do the sudo init 6 to reboot where there is no panel.

@mrmazda: I found the log, but the size and granularity of logging is somewhat shocking (each access to smb share and refresh ever 5 min if Dolphin remains open. SRSLY? Maybe even the password?)

What I see as the end of the file:

Oct 13 10:08:56 j1 NetworkManager[865]: <info>  [1570954136.1306] connectivity: (eth2) timed out
Oct 13 10:09:26 j1 NetworkManager[865]: <info>  [1570954166.7527] dhcp4 (eth2):   address 192.168.100.5
Oct 13 10:09:26 j1 NetworkManager[865]: <info>  [1570954166.7528] dhcp4 (eth2):   plen 25 (255.255.255.255)
Oct 13 10:09:26 j1 NetworkManager[865]: <info>  [1570954166.7840] dhcp4 (eth2):   gateway 192.168.100.100
Oct 13 10:09:29 j1 dbus-daemon[742]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher>
Oct 13 10:09:26 j1 NetworkManager[865]: <info>  [1570954166.8195] dhcp4 (eth2):   lease time 7200
Oct 13 10:09:26 j1 NetworkManager[865]: <info>  [1570954166.8196] dhcp4 (eth2):   nameserver '192.168.100.100'
Oct 13 10:09:26 j1 NetworkManager[865]: <info>  [1570954166.8196] dhcp4 (eth2):   domain name 'ASDFGH.home.arpa'
Oct 13 10:09:26 j1 NetworkManager[865]: <info>  [1570954166.8196] dhcp4 (eth2): state changed bound -> bound
Oct 13 10:09:34 j1 dbus-daemon[742]: [system] Successfully activated service 'org.freedesktop.nm_dispatcher'
Oct 13 10:09:30 j1 systemd[1]: Starting Network Manager Script Dispatcher Service...
Oct 13 10:09:35 j1 nm-dispatcher[17991]: req:1 'dhcp4-change' [eth2]: new request (4 scripts)
Oct 13 10:09:34 j1 systemd[1]: Started Network Manager Script Dispatcher Service.
Oct 13 10:09:35 j1 nm-dispatcher[17991]: req:1 'dhcp4-change' [eth2]: start running ordered scripts...
Oct 13 10:13:57 j1 NetworkManager[865]: <info>  [1570954437.1873] connectivity: (eth2) timed out
Oct 13 10:14:06 j1 plasmashell[1649]: qml: temp unit: 0
Oct 13 10:16:51 j1 plasmashell[1649]: qml: temp unit: 0
Oct 13 10:18:56 j1 NetworkManager[865]: <info>  [1570954736.1737] connectivity: (eth2) timed out
Oct 13 10:23:56 j1 NetworkManager[865]: <info>  [1570955036.1527] connectivity: (eth2) timed out
Oct 13 10:28:24 j1 smartd[759]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Temperature_Case changed from >
Oct 13 10:28:57 j1 NetworkManager[865]: <info>  [1570955337.1932] connectivity: (eth2) timed out
Oct 13 10:33:57 j1 NetworkManager[865]: <info>  [1570955637.1998] connectivity: (eth2) timed out
Oct 13 10:38:56 j1 NetworkManager[865]: <info>  [1570955936.1977] connectivity: (eth2) timed out
Oct 13 10:43:56 j1 NetworkManager[865]: <info>  [1570956236.3148] connectivity: (eth2) timed out
Oct 13 10:48:57 j1 NetworkManager[865]: <info>  [1570956537.2026] connectivity: (eth2) timed out
Oct 13 10:53:57 j1 NetworkManager[865]: <info>  [1570956837.3225] connectivity: (eth2) timed out
Oct 13 10:57:16 j1 NetworkManager[865]: <info>  [1570957036.9764] dhcp4 (eth2):   address 192.168.100.5
Oct 13 10:57:16 j1 NetworkManager[865]: <info>  [1570957036.9931] dhcp4 (eth2):   plen 25 (255.255.255.255)
Oct 13 10:57:16 j1 NetworkManager[865]: <info>  [1570957036.9949] dhcp4 (eth2):   gateway 192.168.100.100
Oct 13 10:57:17 j1 NetworkManager[865]: <info>  [1570957037.2942] dhcp4 (eth2):   lease time 7200
Oct 13 10:57:17 j1 NetworkManager[865]: <info>  [1570957037.3764] dhcp4 (eth2):   nameserver '192.168.100.100'
Oct 13 10:57:31 j1 dbus-daemon[742]: [system] Activating via systemd: service name='org.freedesktop.nm_dispatcher>
Oct 13 10:57:17 j1 NetworkManager[865]: <info>  [1570957037.3764] dhcp4 (eth2):   domain name 'ASDFGH.home.arpa'
Oct 13 10:57:17 j1 NetworkManager[865]: <info>  [1570957037.3975] dhcp4 (eth2): state changed bound -> bound
Oct 13 10:57:37 j1 systemd[1]: Starting Network Manager Script Dispatcher Service...
Oct 13 10:57:56 j1 dbus-daemon[742]: [system] Failed to activate service 'org.freedesktop.nm_dispatcher': timed o>
Oct 13 10:57:57 j1 systemd[1]: Started Network Manager Script Dispatcher Service.
Oct 13 10:58:56 j1 NetworkManager[865]: <info>  [1570957136.1944] connectivity: (eth2) timed out
Oct 13 11:00:21 j1 plasmashell[1649]: libkcups: Renew-Subscription last error: 0 successful-ok
Oct 13 11:00:35 j1 systemd[1]: Started Timeline of Snapper Snapshots.
Oct 13 11:00:42 j1 dbus-daemon[742]: [system] Activating service name='org.opensuse.Snapper' requested by ':1.162>
Oct 13 11:01:05 j1 dbus-daemon[742]: [system] Successfully activated service 'org.opensuse.Snapper'
Oct 13 11:03:57 j1 NetworkManager[865]: <info>  [1570957437.1795] connectivity: (eth2) timed out
Oct 13 11:08:57 j1 NetworkManager[865]: <info>  [1570957737.2129] connectivity: (eth2) timed out
Oct 13 11:13:57 j1 NetworkManager[865]: <info>  [1570958037.3168] connectivity: (eth2) timed out
Oct 13 11:18:57 j1 NetworkManager[865]: <info>  [1570958337.6541] connectivity: (eth2) timed out
Oct 13 11:23:57 j1 NetworkManager[865]: <info>  [1570958637.3425] connectivity: (eth2) timed out
Oct 13 11:28:25 j1 smartd[759]: Device: /dev/sda [SAT], SMART Usage Attribute: 190 Temperature_Case changed from >
Oct 13 11:28:59 j1 NetworkManager[865]: <info>  [1570958938.5247] connectivity: (eth2) timed out

Renewing IP lease at 10:57 was successful (I see in Firewall, too), but even before there is a lot of “(eth2) timed out”, I don’t see why…

At around 11:28 I cut off power, as the box was not reachable via ssh or VNC…

Any ideas what’s going wrong there?

@larry: I had that in the past only after upgrading from 15.0 to 15.1 on this machine. And on a very old Athlon with TW. Reboot never brought Panel back. But if I reinstall my preferred Panel widgets I see they are already installed (little count number “1” on the widget, if the Panel got lost more often in past, as e.g. on the Athlon the number increases every time the Panel get’s lost, I think I was at “6” installs of the widget before I reinstalled TW…).

I cannot help with dhcp or networkmangler problems. All my own installations are fixed IP, so I have no material exposure to the problems in your log.

I don’t expect the NetworkManager to cause the loss of connectivity. Or could that explain why wol did not work after turning power off and on again after some minutes?

I have no idea, why snapper should run on this system, I don’t see something related in Yast → Services and the system is EXT4-only after all…

If btrfs is not, and will not be, employed on the system,

sudo zypper rm btrfs* snapper; sudo zypper al snappe? btrfs*

can eliminate them and prevent their return.

The machine died away again some hours ago… Hardware? Power supply? …

Power supply is the usual prime suspect when power is an issue or adding an additional device creates a bad surprise. They’re typically easy to change.

…remotely without supervision not so easy…

Machine has a brand new power supply, brand new Intel SSD and fresh TW install (instead of Leap 15.1), but apparently the logging was not switched on during install by the idi*t installing TW (aka as me…). The machine was doing fine for about 8 days and then died away.

Now I managed to reboot…

New suspects: Network “card” (miniITX, so it’s fixed to the board). Or miniITX power converter. Or the RAM/mobo (Gigbabyte from 2016)…

I’m a bit out of luck lately with my hardware.


PS: I have to correct me:

/var/log/journal

exists, but

journalctl -b -2

only provides info on the latest boot (today), nothing on previous boots.

…since yesterday doing fine, only things I see in dmesg:

  214.008438] fuse: init (API version 7.31)
[36301.327312] perf: interrupt took too long (2534 > 2500), lowering kernel.perf_event_max_sample_rate to 78750
[51335.777556] perf: interrupt took too long (3181 > 3167), lowering kernel.perf_event_max_sample_rate to 62750
[61419.664801] perf: interrupt took too long (3981 > 3976), lowering kernel.perf_event_max_sample_rate to 50000
[65271.703385] st: Version 20160209, fixed bufsize 32768, s/g segs 256

https://bbs.archlinux.org/viewtopic.php?id=187636