I’m having an issue with a new computer that we have.
It has Opensuse Leap 15.2 installed with an RTX 3060 and all cuda stack working. We chose the 15.2 version instead of 15.3 just to make absolutely sure that cuda would work (since we have other older stable computers with this “combo”). However, we are facing severe freezing issues.
It was originally provided with a Ryzen 9 5950X, but due to these freezing issues that it was facing under load it seemed that downgrading it to a 5900X was a good idea. It improved: instead of freezing in less than 20 minutes under load with 16 threads in use, it seemed ok for a while until longer tests indicated that it now freezes within 36h with the same amount of threads (we are used to keep machines under full load for much longer periods of time).
Before downgrading the CPU we tried to change the PSU, GPU (it freezes even when the GPU is not in use, so this was actually a long shot), memory. The only things left for testing are the 3 fan water cooler, HD (makes no sense) and motherboard.
There might be more. However, I’ve never used no mainstream kernel versions in any upgrade, so I’m worried: which would be the most recommended version to use?
Moreover, I’m worried about facing any unexpected issues during this procedures: is there anything I should look out in advance? (I think it should give me access to both kernel version in the boot time, so an additional version wouldn’t prevent me from logging and having KDE up, am I right? Also, should I pay attention for it needing keys for secure boot as in GPU installation or not?)
When I was using opensuse 15.2, I was using kernel from tumbleweed.
What I did was I only enable the tumbleweed repo whenever I like to install a newer kernel
and use yast2 to install. When done installing I disable it again.
I am not endorsing this procedure because I guess I was just too daring to do it.
It was just fortunate that I did not encounter issues or if there was one I just revert back my
kernel to the opensuse 15.2.
Juste because it is a proven and tested platform on previous systems for cuda that we re certain that we will not encounter any software issues (however we hit a roadblock with a possible sotware-hardware incompability… ).
Accordingly to the phoronix news, 5.9 is still not enough, it should be at least 5.10…
If there is no other option it might be the very risky way to move forward: How precisely did you do that so that you were mostly certain that reversing back would be possible? Did you install two kernel versions simultaneously (I remember it was possible a long time ago), or just relied on the BTRFS snapshots rollback?
Sorry for being away: I was facing 503 and 504 errors when trying to reach the forums in the last days.
Thanks @nrickert. That is actually bad news to me, as that seemed to be the best first option to try.
If I still chose to go by that route, are there any recommendations on how to proceed so that I could try to safely revert to the previous kernel (without having to reinstall the whole system)?
I like to think of CUDA as a “very powerful yet sensitive piece of software, which is also very stressful to install”, and as such I would prefer to not upgrade the whole opensuse from 15.2 to 15.3 as a first attempt…
(btw, where can I look to be certain that the necessary modules for temperature monitoring of zen3 processors were backported into the kernel 5.9?)
Isn’t that one of the options @conram advised against?
However, it is becoming my only alternative… Is there anything else I should look for to be absolutely certain that both kernels would be available at “grub-time”?
Hi Svyatko! Thanks for the suggestion, but already done that, and problem is still there.
Already done, problem still there.
Already done, problem still there.
Also changed RAM, CPU. It happens randomly when using high and intense computing, and not only when the GPU is used (it also happens when only the CPU is used, non-CUDA configuration and compilation of the application)
That might be my last resort. And I might need to go for it. However, CUDA is a bit* to make work properly (heart attack level bit*), so I’m considering all options before moving on to it.
Regular kernel updates for the Leap release should remain binary compatible, so modules built for previous kernels remain usable also with updated kernels. For Leap nVidia modules are rebuilt only when modules themselves change, not for every new kernel update.
Moreover, just trying to add a new module to the kernel would not suffice, correct?
After all your suggestions, I concluded that the best path to follow is (unfortunately) to upgrade the opensuse as a whole from 15.2 to 15.3.
Tomorrow I’ll make an attempt to make this upgrade, and hopefully it will solve the issue at CPU computations (I’m uncertain if I’ll be able to successfully reinstall cuda in time, since it usually takes too long and local access in pandemics is always an issue): however I’ll have to leave the system running for some time before it can be considered stable.
Also, even if the standard 15.3 kernel version proves itself to not be enough, it will be easier to use one of the newer versions previously mentioned.
I’ll keep you all posted. Anyway, thanks a lot for all suggestions!
Connect PSU to 1 x 8-pin ATX 12V power connector on mobo.
IRL B550M AORUS ELITE is not intended for top CPUs - this is only marketing bs. B550M AORUS PRO (rev. 1.0) could be better, or look at ATX mobos.
You need additional cooling for power circuits, especially when using water cooling. You need to cool both lines of power circuits - leftward and upward. Try to use additional fan to blow air on them.
Possibly you need one more fan to cool RAM.
It is better to change motherboard. And X570 chipset can be more useful in that case.
Thanks a lot… Unfortunately, the information arrived after I did the distribution upgrade: now I have the 15.3 but without gpgpu working and with some well-known resolution issues (nothing I can’t deal with in person, the problem is just being able to be in front of the computer in person… ). And, of course, the program still fails after some time…
I was going to ask here for instructions on how to extra upgrade the kernel, but since that is not the issue and I’m already looking for a MB replacement at the seller in a time where all resources are almost disappeared… Could I just rapidly ask your opinion on Gigabyte X570 UD PCIe 4.0 AM4 and the ASRock X570 Phantom Gaming 4 - AM4 MBs for such an application? Those are the only X570 MBs they have in stock right now… If those are not good options, we will need to buy new ones from another seller (for the full price), and then the alternatives are Gigabyte X570 Gaming X (Socket AM4) AMD X570 and ASUS TUF GAMING X570-PLUS (Socket AM4/AMD X570/M.2): once again, what is your opinion on them?