Tumbleweed keeps randomly cold rebooting, no idea where to start troubleshooting

Hey guys. Sorry in advance, super new both to this forum and to Linux in general.

TL;DR: After removing an empty SSD from my computer a few days ago, Tumbleweed will last ~30 minutes after booting before cold rebooting, making the OS almost completely unusable. Not only has the issue persisted after a fresh install of the OS, but nothing like it even remotely happens on Windows 10, which I have installed on a separate drive.

Basically, a few days ago I decided to remove one of my old (empty, but partitioned) SSDs from my computer to use in a different project. Being new, I neglected to unmount the drive before doing this, so when I booted my PC back up it took me into emergency mode. No big, after some brain wracking I figured out how to edit the fstab and was back in in no time. However, afterwards I started experiencing intermittent crashing–I’d be in the middle of something and suddenly my PC will cold shutdown as if I pulled the cord out of the wall, then reboot.

I did a little testing, which included seeing if specific programs were the cause (couldn’t find any, might happen if I’m playing a game, watching YouTube, downloading something, or if the PC is just sitting idle on the lock screen after boot), seeing if the same thing happened on my separate-drive Windows install (it didn’t), seeing if a part was overheating (GPU is a little warm but nowhere near the point of shutdowns), checking ‘last’ in the terminal (just listed as ‘crash’, very helpful), checking logs (doesn’t seem like the OS had a chance to even write a log), and dusting the inside of my PC (didn’t help).

I even went so far as to completely reinstall Tumbleweed from scratch about an hour before making this post, and it seemed fine at first… until it happened again, a little later than it usually would, before I even had a chance to change any of the drivers.

No idea what’s happening, and Reddit was as helpful as one might expect (as in, my post got no replies), and my Linux wiz friend had no idea either. Please save me, I love Tumbleweed!

Component Info:

  • Processor: 16 × AMD Ryzen 7 7800X3D 8-Core Processor
  • GPU: AMD Radeon RX 6800
  • RAM: 32GB, about a year old
  • Root SSD: Samsung SSD 970 EVO Plus 2TB (2B2QEXM7)

This might be a loose connection. As you opened the case and removed the SSD, you might have accidently loosened a cable. Check again all power connectors inside of the case and also all other connetors.

All of my power connectors should be clipped in, but I checked anyway. The root SSD is an NVME that’s screwed in, so that wouldn’t be going anywhere.

I jiggled and pushed in every cord I could find and booted, cold reboot after about 30 minutes again.

Likely hardware issue (not openSUSE-related). Check for possible overheating, faulty power supply, malfunctioning RAM (or bad seating perhaps). I would fire up a live test Linux distro, and run some diagnostic tests. You may also be able to undertake some testing from the BIOS/UEFI.

Run this:

journalctl | grep "Hardware Error"

I had one on March 12, 18 and the 19th.
Then April 3 and the 13th. This is what the error looks like.

Mar 12 06:19:21 Suse-R5950X kernel: mce: [Hardware Error]: Machine check events logged

Mar 12 06:19:21 Suse-R5950X kernel: mce: [Hardware Error]: CPU 0: Machine Check: 0 Bank 1: bc800800060c0859

Mar 12 06:19:21 Suse-R5950X kernel: mce: [Hardware Error]: TSC 0 ADDR 5b42f03c0 MISC d012000000000000 IPID 100b0000000

Note that “Bank 1” can be “Bank 5” or whatever.

Mine seems to be new kernels and the firmware not being updated at the same time. I boot on a kernel that’s one step back like with kernel 6.14.1-1.1, I used kernel 6.13.8-1.1 and it quit doing it. I’ve done this for quite a while and it seems to work for me.

When the firmware files get updated it seems to fix it. It will run great for a long time, then I get a new kernel and it’ll just randomly reboot.

I’ve tried slowing the memory down, limiting CPU speed and none of it makes a difference at all. All hardware is good, plenty of power, everything connected good, and it doesn’t do it when I boot to Mageia.

I have update logs that I save at update time. I’ve been going to go through them and see if I can pinpoint which firmware it is. I believe it’s my Intel firmware for my ARC A750 graphics card, but I need to check and make sure.

If you have no hardware errors, then you have a totally different problem.

Check for everything service mcelog reports:

erlangen:~ # journalctl --quiet --unit mcelog.service --priority notice 
erlangen:~ # 

No events logged since July 2024 on IHE.

I can’t save you. Providing information on all components may help tracking down the root cause of your problem:

erlangen:~ # inxi -zFmy333
System:    Kernel: 6.14.1-1-default arch: x86_64 bits: 64
           Console: pty pts/1 Distro: openSUSE Tumbleweed 20250411
Machine:   Type: Desktop System: Micro-Star product: MS-7C56 v: 2.0 serial: N/A
           Mobo: Micro-Star model: B550-A PRO (MS-7C56) v: 2.0 serial: <filter> UEFI: American Megatrends LLC. v: A.90 date: 03/17/2022
Memory:    System RAM: total: 32 GiB available: 27.3 GiB used: 12.5 GiB (45.8%)
           Array-1: capacity: 128 GiB slots: 4 modules: 2 EC: None
           Device-1: Channel-A DIMM 0 type: no module installed
           Device-2: Channel-A DIMM 1 type: DDR4 size: 16 GiB speed: 2133 MT/s
           Device-3: Channel-B DIMM 0 type: no module installed
           Device-4: Channel-B DIMM 1 type: DDR4 size: 16 GiB speed: 2133 MT/s
CPU:       Info: 8-core model: AMD Ryzen 7 5700G with Radeon Graphics bits: 64 type: MT MCP cache: L2: 4 MiB
           Speed (MHz): avg: 3538 min/max: 400/4673 cores: 1: 3538 2: 3538 3: 3538 4: 3538 5: 3538 6: 3538 7: 3538 8: 3538 9: 3538 10: 3538 11: 3538 12: 3538 13: 3538 14: 3538 15: 3538 16: 3538
Graphics:  Device-1: Advanced Micro Devices [AMD/ATI] Cezanne [Radeon Vega Series / Radeon Mobile Series] driver: amdgpu v: kernel
           Display: unspecified server: X.Org v: 21.1.15 with: Xwayland v: 24.1.6 driver: X: loaded: modesetting unloaded: vesa dri: radeonsi gpu: amdgpu resolution: 3840x2160~60Hz
           API: OpenGL v: 4.6 vendor: amd mesa v: 25.0.3 renderer: AMD Radeon Graphics (radeonsi renoir ACO DRM 3.61 6.14.1-1-default)
           API: Vulkan v: 1.4.309 drivers: N/A surfaces: xcb,xlib
           API: EGL Message: EGL data requires eglinfo. Check --recommends.
           Info: Tools: api: clinfo, glxinfo, vulkaninfo de: kscreen-console,kscreen-doctor wl: wayland-info x11: xdriinfo, xdpyinfo, xprop, xrandr
Audio:     Device-1: Advanced Micro Devices [AMD/ATI] Renoir Radeon High Definition Audio driver: snd_hda_intel
           Device-2: Advanced Micro Devices [AMD] Family 17h/19h/1ah HD Audio driver: snd_hda_intel
           API: ALSA v: k6.14.1-1-default status: kernel-api
Network:   Device-1: Intel Wi-Fi 6E AX210/AX1675 2x2 [Typhoon Peak] driver: iwlwifi
           IF: wlan0 state: down mac: <filter>
           Device-2: Realtek RTL8111/8168/8211/8411 PCI Express Gigabit Ethernet driver: r8169
           IF: eth0 state: up speed: 1000 Mbps duplex: full mac: <filter>
Drives:    Local Storage: total: 5.46 TiB used: 2.18 TiB (40.0%)
           ID-1: /dev/nvme0n1 vendor: Samsung model: SSD 970 EVO Plus 2TB size: 1.82 TiB
           ID-2: /dev/nvme1n1 vendor: Samsung model: SSD 990 EVO 2TB size: 1.82 TiB
           ID-3: /dev/sda vendor: Crucial model: CT2000BX500SSD1 size: 1.82 TiB
Partition: ID-1: / size: 1.82 TiB used: 911.43 GiB (48.9%) fs: btrfs dev: /dev/nvme1n1p2
           ID-2: /boot/efi size: 99.8 MiB used: 662 KiB (0.6%) fs: vfat dev: /dev/nvme1n1p1
           ID-3: /home size: 1.82 TiB used: 911.43 GiB (48.9%) fs: btrfs dev: /dev/nvme1n1p2
           ID-4: /opt size: 1.82 TiB used: 911.43 GiB (48.9%) fs: btrfs dev: /dev/nvme1n1p2
           ID-5: /var size: 1.82 TiB used: 911.43 GiB (48.9%) fs: btrfs dev: /dev/nvme1n1p2
Swap:      Alert: No swap data was found.
Sensors:   System Temperatures: cpu: 40.1 C mobo: N/A gpu: amdgpu temp: 36.0 C
           Fan Speeds (rpm): fan-1: 631 fan-2: 0 fan-3: 0 fan-4: 0 fan-5: 0 fan-6: 0 fan-7: 0 fan-8: 0 fan-9: 0 fan-10: 0
Info:      Processes: 450 Uptime: 4d 7h 35m Shell: Bash inxi: 3.3.37
erlangen:~ # 

Was your PSU new at same time as CPU and motherboard? If it’s older, remove it, take cover off, and inspect for two things: 1-electrolytic capacitors oozing or swollen, 2-electrolytic capacitors not branded Nichicon, Panasonic, UCC (Chemicon) or Rubycon. They are the top brands for reliability. Absent leakage or swelling, if they are all you see, PSU failure is much less likely. One brand that you don’t want to see is OST, which often fail without evident leakage or swelling. Brand of the single or pair of giant caps usually doesn’t matter. The 10mm wide ones 16mm-35mm tall bunched together are the common failures. Visit badcaps.net forums for more cap details, like the long list of names of low- or non-reliable caps.

Run a memory tester multiple passes, at least 4 full passes. The more the better.

If any SATA cables are old and vermilion, replace them.

Make sure your main power cord is a snug fit in the PSU. You might also check where the power cord goes. If directly to wall rather than UPS, and it’s known to be old, or used with aluminum wire, make sure it remains up to code with tight connections. If UPS and/or surge-protector is old, maybe it’s due for batteries or replacement.

Thank you for all of your responses. I think I may have narrowed down the problem.

It turns out the other drive I had partitioned for use in Tumbleweed was not playing nicely with other programs, and may have somehow been corrupted when I removed the other drive (and somehow whatever that was didn’t get formatted when I reinstalled the OS?). When I would boot my PC, inevitably something would try to access the drive (probably Steam) and invoke bad juju into active memory.

I disconnected the drive fully and the crashes stopped. Probably explains why it didn’t happen in Windows since my Windows install doesn’t have access to that drive.

one of the VERY first things i check is " DUST " !!!
as in when was the last time you removed it ?