Which Newer Kernel Version to Use?

Hi folks.

I’m having an issue with a new computer that we have.

It has Opensuse Leap 15.2 installed with an RTX 3060 and all cuda stack working. We chose the 15.2 version instead of 15.3 just to make absolutely sure that cuda would work (since we have other older stable computers with this “combo”). However, we are facing severe freezing issues. :frowning:

It was originally provided with a Ryzen 9 5950X, but due to these freezing issues that it was facing under load it seemed that downgrading it to a 5900X was a good idea. It improved: instead of freezing in less than 20 minutes under load with 16 threads in use, it seemed ok for a while until longer tests indicated that it now freezes within 36h with the same amount of threads (we are used to keep machines under full load for much longer periods of time).

Before downgrading the CPU we tried to change the PSU, GPU (it freezes even when the GPU is not in use, so this was actually a long shot), memory. The only things left for testing are the 3 fan water cooler, HD (makes no sense) and motherboard.

Given that the MB is a bit of a painful attempt (that might not fix it), I resumed my online searchs for other clues. Then, I found on phoronics the information that the temperature monitoring on Zen3 was not available until kernel 5.10 (https://www.phoronix.com/scan.php?page=news_item&px=Linux-5.10-HWMON-Zen-3). I just checked to discover that opensuse leap 15.2 have (on the most standard repositories) only up to 5.3.18, and even 15.3 doesn’t have a much newer kernel by default.

I begin to search the repositories and actually found several version of kernels >5.10, for example:
5.15 in https://download.opensuse.org/repositories/Kernel:/vanilla/standard/x86_64/
5.14.12 in https://download.opensuse.org/repositories/Kernel:/stable:/Backport/standard/x86_64/
5.14.13 in https://download.opensuse.org/repositories/Kernel:/stable/standard/x86_64/

There might be more. However, I’ve never used no mainstream kernel versions in any upgrade, so I’m worried: which would be the most recommended version to use?

Moreover, I’m worried about facing any unexpected issues during this procedures: is there anything I should look out in advance? (I think it should give me access to both kernel version in the boot time, so an additional version wouldn’t prevent me from logging and having KDE up, am I right? Also, should I pay attention for it needing keys for secure boot as in GPU installation or not?)

Could someone give me some advice on that?

Thanks a lot in advance! :wink:

  1. Why openSUSE Leap 15.2:

https://en.opensuse.org/Lifetime

  1. Kernel:stable is build against factory and will not work in Leap 15.x

  2. kernel:stable:backports is build against Leap 15.3 and may be also not work

I would install Leap 15.3, its a backported kernel 5.9

When I was using opensuse 15.2, I was using kernel from tumbleweed.
What I did was I only enable the tumbleweed repo whenever I like to install a newer kernel
and use yast2 to install. When done installing I disable it again.

I am not endorsing this procedure because I guess I was just too daring to do it.
It was just fortunate that I did not encounter issues or if there was one I just revert back my
kernel to the opensuse 15.2.

Juste because it is a proven and tested platform on previous systems for cuda that we re certain that we will not encounter any software issues (however we hit a roadblock with a possible sotware-hardware incompability… :frowning: ).

Accordingly to the phoronix news, 5.9 is still not enough, it should be at least 5.10…

If there is no other option it might be the very risky way to move forward: How precisely did you do that so that you were mostly certain that reversing back would be possible? Did you install two kernel versions simultaneously (I remember it was possible a long time ago), or just relied on the BTRFS snapshots rollback?

That’s probably not going to work now. With the Tumbleweed “usr-merge”, there’s now an incompatibility between Tumbleweed kernels and Leap.

Sorry for being away: I was facing 503 and 504 errors when trying to reach the forums in the last days.

Thanks @nrickert. That is actually bad news to me, as that seemed to be the best first option to try. :frowning:

If I still chose to go by that route, are there any recommendations on how to proceed so that I could try to safely revert to the previous kernel (without having to reinstall the whole system)?

I like to think of CUDA as a “very powerful yet sensitive piece of software, which is also very stressful to install”, and as such I would prefer to not upgrade the whole opensuse from 15.2 to 15.3 as a first attempt…

(btw, where can I look to be certain that the necessary modules for temperature monitoring of zen3 processors were backported into the kernel 5.9?)

I think you will get a conflict on the file system version, if you attempt to install a Tumbleweed kernel.

Best would be to use the kernel available at

http://download.opensuse.org/repositories/Kernel:/stable:/Backport/standard/

The kernels there should work for Leap systems. In particular, they should be okay for Leap 15.2 (as far as I know).

The previous kernel should remain on your system, and you can select it with the grub menu.

You can edit the line

multiversion.kernels = latest,latest-1,running

which is in “/etc/zypp/zypp.conf”. That sets which kernels will be retained. There are comments in the file to suggest what changes you can make.

Use UPS. Try to update motherboard BIOS. Test video card with another machine.
Use Leap 15.3. Soon you’ll get no help with Leap 15.2.
Thermal sensors are needed only for temperature monitoring.

Hi @nrickert!

oh-oh, more troubles ahead… :frowning:

Isn’t that one of the options @conram advised against? :frowning:

However, it is becoming my only alternative… Is there anything else I should look for to be absolutely certain that both kernels would be available at “grub-time”?

Hi Svyatko! Thanks for the suggestion, but already done that, and problem is still there.

Already done, problem still there.

Already done, problem still there.

Also changed RAM, CPU. :wink: It happens randomly when using high and intense computing, and not only when the GPU is used (it also happens when only the CPU is used, non-CUDA configuration and compilation of the application)

That might be my last resort. And I might need to go for it. However, CUDA is a bit* to make work properly (heart attack level bit*), so I’m considering all options before moving on to it.

Thermal sensors are not need for CPU throttling?

Thank you all.

If you use kernel:stable:backports, all kmps build against kernel 5.3 from the OSS Repo do not work anymore.
Also the nvidia kmp.

You have to build them by using the run File from the Nvidia Side.

Provide info about motherboard and PSU. ILL power circuit overheats.

Meaning there is no escape from having to rebuild everything related to nvidia anyway?

Just for curiosity, it would be the same in any other kernel upgrade situation?

Moreover, just trying to add a new module to the kernel would not suffice, correct?

Hi @Svyatko. The configuration is:

MB:Gigabyte B550M Aorus Elite - AM4
PSU: Seasonic 850W 80 Plus Gold - GM-85
WC: Alseye Halo H360 360mm
GPU: Galax GeForce RTX 3060 12GB 1-Click OC GDDR6 - 36NOL7MD1VOC
RAM: Kingston HyperX Fury 16GB DDR4 3200Hz (1x16GB) - HX432C16FB3/16 (4 modules)

If you need any more information please let me know. Thanks a lot!

Regular kernel updates for the Leap release should remain binary compatible, so modules built for previous kernels remain usable also with updated kernels. For Leap nVidia modules are rebuilt only when modules themselves change, not for every new kernel update.

Moreover, just trying to add a new module to the kernel would not suffice, correct?

I do not understand this question.

Please do not post inside a qoute:

If you use kernel:stable:backports, all kmps build against kernel 5.3 from the OSS Repo do not work anymore.
Also the nvidia kmp.

Hi @Sauerland, thanks for the reply.

You have to build them by using the run File from the Nvidia Side.

post directly before or behind the quote.

Hi everybody.

After all your suggestions, I concluded that the best path to follow is (unfortunately) to upgrade the opensuse as a whole from 15.2 to 15.3.

Tomorrow I’ll make an attempt to make this upgrade, and hopefully it will solve the issue at CPU computations (I’m uncertain if I’ll be able to successfully reinstall cuda in time, since it usually takes too long and local access in pandemics is always an issue): however I’ll have to leave the system running for some time before it can be considered stable.

Also, even if the standard 15.3 kernel version proves itself to not be enough, it will be easier to use one of the newer versions previously mentioned.

I’ll keep you all posted. Anyway, thanks a lot for all suggestions! :wink:

Connect PSU to 1 x 8-pin ATX 12V power connector on mobo.
IRL B550M AORUS ELITE is not intended for top CPUs - this is only marketing bs.
B550M AORUS PRO (rev. 1.0) could be better, or look at ATX mobos.
You need additional cooling for power circuits, especially when using water cooling. You need to cool both lines of power circuits - leftward and upward. Try to use additional fan to blow air on them.
Possibly you need one more fan to cool RAM.

It is better to change motherboard. And X570 chipset can be more useful in that case.

Your Gigabyte B550M Aorus Elite is intended for office use with low-end CPUs.
Test for cheap B550 mobos with Ryzen 7 3800X (in russian): https://3dnews.ru/1026727/obzor-7-materinskih-plat-amd-b550-deshevle-10-000-rubley/page-4.html
2 of 7 mobos work OK, 1 works OK with additional cooling, 4 fails.
As I wrote - unstable work with for both Gigabyte cheap mobos: GIGABYTE B550M DS3H and GIGABYTE B550M S2H.

You need good engineer in your team, not new Linux kernel.

Hi @Svyatko.

Thanks a lot… Unfortunately, the information arrived after I did the distribution upgrade: now I have the 15.3 but without gpgpu working and with some well-known resolution issues (nothing I can’t deal with in person, the problem is just being able to be in front of the computer in person… :stuck_out_tongue: ). And, of course, the program still fails after some time… :frowning:

I was going to ask here for instructions on how to extra upgrade the kernel, but since that is not the issue and I’m already looking for a MB replacement at the seller in a time where all resources are almost disappeared… Could I just rapidly ask your opinion on Gigabyte X570 UD PCIe 4.0 AM4 and the ASRock X570 Phantom Gaming 4 - AM4 MBs for such an application? Those are the only X570 MBs they have in stock right now… If those are not good options, we will need to buy new ones from another seller (for the full price), and then the alternatives are Gigabyte X570 Gaming X (Socket AM4) AMD X570 and ASUS TUF GAMING X570-PLUS (Socket AM4/AMD X570/M.2): once again, what is your opinion on them?

Post:

zypper se -si nvidia kernel
uname -a
zypper lr -d

Hi @Sauerland. Sorry for the delayed answer, once again I ran into login problems.

S  | Name                        | Type   | Version                          | Arch        | Repository 
---+-----------------------------+--------+----------------------------------+-------------+------------------------------
------------------------------- 
i+ | kernel-default              | pacote | 5.3.18-59.27.1                   | x86_64      | Update repository with updates from SUSE Linux Enterprise 15 
i+ | kernel-default              | pacote | 5.3.18-57.3                      | x86_64      | Main Repository 
i+ | kernel-default-devel        | pacote | 5.3.18-59.27.1                   | x86_64      | Update repository with updates from SUSE Linux Enterprise 15 
i+ | kernel-default-devel        | pacote | 5.3.18-57.3                      | x86_64      | Main Repository 
i  | kernel-default-extra        | pacote | 5.3.18-59.27.1                   | x86_64      | Update repository with updates from SUSE Linux Enterprise 15 
i  | kernel-default-extra        | pacote | 5.3.18-57.3                      | x86_64      | Main Repository 
i  | kernel-default-optional     | pacote | 5.3.18-59.27.1                   | x86_64      | Update repository with updates from SUSE Linux Enterprise 15 
i  | kernel-default-optional     | pacote | 5.3.18-57.3                      | x86_64      | Main Repository 
i  | kernel-devel                | pacote | 5.3.18-59.27.1                   | noarch      | Update repository with updates from SUSE Linux Enterprise 15 
i  | kernel-devel                | pacote | 5.3.18-57.3                      | noarch      | Main Repository 
i  | kernel-firmware-all         | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-amdgpu      | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-ath10k      | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-ath11k      | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-atheros     | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-bluetooth   | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-bnx2        | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-brcm        | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-chelsio     | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-dpaa2       | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-i915        | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-intel       | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-iwlwifi     | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-liquidio    | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-marvell     | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-media       | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-mediatek    | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-mellanox    | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-mwifiex     | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-network     | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-nfp         | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-nvidia      | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-platform    | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-prestera    | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-qlogic      | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-radeon      | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-realtek     | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-serial      | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-sound       | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-ti          | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-ueagle      | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-firmware-usb-network | pacote | 20210208-2.4                     | noarch      | Main Repository 
i  | kernel-macros               | pacote | 5.3.18-59.27.1                   | noarch      | Update repository with updates from SUSE Linux Enterprise 15 
i  | nvidia-computeG05           | pacote | 495.29.05-0                      | x86_64      | cuda-opensuse15-x86_64 
i  | nvidia-gfxG05-kmp-default   | pacote | 495.29.05_k4.12.14_lp150.12.82-0 | x86_64      | cuda-opensuse15-x86_64 
i  | nvidia-glG05                | pacote | 495.29.05-0                      | x86_64      | cuda-opensuse15-x86_64 
i  | purge-kernels-service       | pacote | 0-8.3.1                          | noarch      | Update repository with updates from SUSE Linux Enterprise 15 
i  | x11-video-nvidiaG05         | pacote | 495.29.05-0                      | x86_64      | cuda-opensuse15-x86_64
Linux localhost.localdomain 5.3.18-59.27-default #1 SMP Tue Oct 5 10:00:40 UTC 2021 (7df2404) x86_64 x86_64 x86_64 GNU/Linux

Sorry, I’ve had to divide the answer because the outputs led to it having too much characters…