Boot fails after irqbalance failure

Cyclonick · July 8, 2025, 10:01am

I am trying to install Tumbleweed on a new machine consisting of an AMD Ryzen 5 8600G processor mounted on an ASRock A620M Pro RS mother board which uses the AMD A620 chipset.

Previously (the first chapter of this saga - https://forums.opensuse.org/t/virtual-hub-in-amd-chipset-blocks-booting/186191/38 ) It had been established there is a kernel bug and a patch applied.

Now boot hangs to a command line prompt - I looked at journalctl (error messages only here https://paste.opensuse.org/pastes/f3c968016d69 ). Essentially there is a sequence of 8 pairs of entries as below and what follows would seem to be boot shutting itself down.

Jul 06 18:29:16 localhost.localdomain irqbalance[932]: e[0;1;38:5:185me[0;1;39me[0;1;38:5:185mCannot change IRQ 47 affinity: Permission deniede[0m
Jul 06 18:29:16 localhost.localdomain irqbalance[932]: e[0;1;38:5:185me[0;1;39me[0;1;38:5:185mIRQ 47 affinity is now unmanagede[0m

the IRQs concerned are 55,53,51,48,56,54,62 - 47’s disappeared - I seemed to have cocked up the pasting, first in sequence I think.

I know very little about this but what I understand (or think I do) is that IRQs are hardware nodes (?) and they are used to manage how they interact with the system.

I have zero skills, so I need command line instructions to help analyze the situation.

Is it a kernel problem (access denied suggests it could be). If it is I need the proof to report it to the kernel devs. Or is it an openSuse problem?

Is it reasonable to suggest a first step is to find out what the IRQs refer to ?

malcolmlewis · July 8, 2025, 12:21pm

@Cyclonick AFAIK some hardware irq’s are read only. I see them here on my systems to no ill affect. Run systemctl status irqbalance.service it’s active, if so all good…

Cyclonick · July 8, 2025, 12:52pm

I think part of the shutting down is that irqbalance can’t work.

Is there a way of getting the system to list the IRQs and to what they correspond - if I look in Kinfo on this computer there’s a sort of table with a list of IRQs and some numbers per cpu core - I haven’t a clue what they mean, but the last column has names / origins of each one.

Of course it could be something else entirely… If you remember, it looked as though the graphics were the problem when in fact it was the virtual hub business.

Thing is journalctl is the only tool I know of and I don’t know how to use that effectively.

My bike is looking at me accusingly, I promised it an outing…

malcolmlewis · July 8, 2025, 1:15pm

@Cyclonick cat /proc/interrupts > irq.txt and open in a text editor, then there is irqtop

Cyclonick · July 8, 2025, 2:14pm

Thanks - I’ll try them, also see what I ac glean with systemctl… I’m having to learn things I don’t want to know…!

malcolmlewis · July 8, 2025, 2:22pm

@Cyclonick the joys of new hardware

Cyclonick · July 8, 2025, 9:10pm

I did cat /proc/interrupts > irq.txtand I find that the numbers of the IRQs are all associated with nvme0q0, …1,…2,…3 etc. to nvme0q11 1 for each of 12 processors (I thought ryzen 5 had 6) but anyway nvme0q is the name given to my SSD. Why I don’t know, but my partitions are on for root, home etc are nvme0q1, nvme0q2 etc.

So not being able to access them would be fairly significant I would think…

I did notice before the beginning of this saga, that, as my first reaction was to try a live image and see if things were normal as far as files are concerned - it’s never happened to me before, but the live image files app was denied access to files on the disk, wouldn’t even mount.

I tried dmesg - no errors - 2 warnings :
e[33mplatform regulatory.0: e[0me[1mDirect firmware load for regulatory.db failed with error -2e[0m and [33mamdgpu 0000:0c:00.0: e[0me[1m[drm] REG_WAIT timeout 1us * 100000 tries - optc314_disable_crtc line:145e[0m

I tried various systemctl instructions, but it looked as though it thought everything is normal…

Cyclonick · July 8, 2025, 9:30pm

By numbers of IRQs I mean the series of numbers that irqbalance said were now unmanaged.

(trying to be clear)

malcolmlewis · July 8, 2025, 9:54pm

@Cyclonick yes, understood, but since they are not writeable, then unmanageable… again nothing AFAIK to worry about.

regulatory.db is related to wireless devices, do you have one of those present?

[drm] see here https://gitlab.freedesktop.org/drm/amd/-/issues/3368

Cyclonick · July 8, 2025, 10:32pm

The point is, those unwritable partitions are my system, home and data partitions… somewhere, perhaps, some security thing has happened (perhaps)

There are both bluetooth and wifi attached.

malcolmlewis · July 9, 2025, 11:07am

@Cyclonick Not sure what you mean, your partitions are writeable, look at he output from the command mount, nothing to do with interrupts?

Are you using Bluetooth and Wifi?

Cyclonick · July 9, 2025, 11:46am

I think you’re right, I found this https://www.suse.com/support/kb/doc/?id=000021663

During my systemctl session, which I looked at again this morning, there’s this : irqbalance.service... /usr/sbin/irqbalance --banmod megaraid_sas which after a bit of searching means affinity won’t work if the megaraid_sas module is working.

Which, if indeed megaraid-sas module is running would explain the message - except that the IRQ isn’t un-managed, it is managed by something else.

Both bluetooth and wi-fi are supposed to fire up on boot (that’s part of the install) - usually I turn them both off unless needed when I can get into the machine ! - I haven’t tried using them - it’s connected through eth0

How to I find out which kernel modules are installed and their status ? Could be megaraid is installed, but not working properly…

malcolmlewis · July 9, 2025, 12:02pm

@Cyclonick are you running SAS drives and/or hardware RAID in this setup?

I would check if bluetooth and Wifi are actually working, if they do and the warning disappears, likely can ignore.

Cyclonick · July 9, 2025, 12:17pm

I have no idea. This chipset / motherboard is probably capable of navigating to the moon. I think somewhere it claims to support RAID. The setup is 3 components : motherboard, which includes Wi-Fi and Bluetooth; Processor which has onboard graphics ; and SSD 1Tb which may, or may not be SAS I’d have to go in to find out.

I suppose I’ll have to approach it component by component. But I don’t know how.

I’m worn out with this.

malcolmlewis · July 9, 2025, 12:30pm

@Cyclonick You need to go look around in the system BIOS to see what is on, off, in use etc.

Is the SSD attached to the megaraid connection on the motherboard? Or just a normal SATA connection and set to AHCI?

Cyclonick · July 9, 2025, 12:48pm

I have no idea. I’ll try and find out.

Thing is, if there isn’t RAID or SAS in use, there is no reason for the megaraid module to be running and there is no reason for IRQ management to turn itself off.

malcolmlewis · July 9, 2025, 1:08pm

@Cyclonick The interrupts are allocated to specific items in your motherboard, all that it’s doing is telling you that, the managed ones are available and under control of the operating system (irq balance) as and when needed.

For example, you don’t want the IRQ for your storage device to change, so it’s fixed and it can’t touch it, so permission is denied.

Like I indicated, nothing to worry about

Cyclonick · July 9, 2025, 2:16pm

Part of an answer to your question - and why my SSD is not called sda1 - NVMe is a type that runs under PCIe version 3 or 4 (don’t know if that’s what SAS is). They are absurdly and terrifyingly fast up to 7500 Mb / sec depending on quality and protocol.

So probably the module is running and affinity is behaving !

I can add 4 SATA devices as well

Honestly, what the hell have I bought ?!

Today’s episode with the keyboard is going to be gathering information - finding what is inside this monster. I’ll see if the Wi-Fi and Bluetooth function, too.

malcolmlewis · July 9, 2025, 2:30pm

@Cyclonick no generally SAS Drives are rotating rust at 12Gb/s rather than standard SATA at 6Gb/s…

NVMe’s are in use here… as well as Hardware RAID for SSD’s on my HP Workstation.

Drives:
  Local Storage: total: 1.97 TiB used: 533.83 GiB (26.4%)
  ID-1: /dev/nvme0n1 vendor: Silicon Power model: SPCC M.2 PCIe SSD
    size: 953.87 GiB speed: 63.2 Gb/s lanes: 4 serial: <filter> temp: 37.9 C
  ID-2: /dev/sda vendor: Silicon Power model: SPCC Solid State Disk
    size: 476.94 GiB speed: 6.0 Gb/s serial: <filter>
  ID-3: /dev/sdb vendor: Silicon Power model: SPCC Solid State Disk
    size: 476.94 GiB speed: 6.0 Gb/s serial: <filter>
  ID-4: /dev/sdc vendor: OCZ model: VERTEX460A size: 111.79 GiB
    speed: 6.0 Gb/s serial: <filter>

On my Dell workstation I have a SAS/SATA/NVMe tri-controller available, but no SAS drives…

Cyclonick · July 10, 2025, 5:58pm

I think it’s the video… I went in using recovery mode - and of course it went too quick and finished with text too small to read… and I couldn’t work out how to get a command line prompt.

There was a slew of entries about amdgpu and the one about 10000 attempts and stopping

Idid inxi -Gx and got

1;34mCPU:e[0m 6-core AMD Ryzen 5 8600G w/ Radeon 760M Graphics (-MT MCP-) e[1;34mspeed/min/max:e[0m 2383/414/5076 MHze[0m
e[1;34mKernel:e[0m 6.15.4-1-default x86_64 e[1;34mUp:e[0m 0h 3m e[1;34mMem:e[0m 734.8 MiB/14.71 GiB (4.9%)e[0m
e[1;34mStorage:e[0m 989.72 GiB (0.4% used) e[1;34mProcs:e[0m 285 e[1;34mShell:e[0m Bash e[1;34minxi:e[0m 3.3.37e[0m

does “procs” mean processes ? if so process 285 doesn’t exist. I also tried to catin the /proc directory and got no such folder or file and if I use systemctl there’s no mention of graphics/video

How do I get a command line prompt in recovery mode or, how do I get access to the last messages I couldn’t read in the recovery mode boot ? (the fundamental question)

I’m also so angry I feel sick - I discover that openSuse has dumped X11 - my use of a computer is almost exclusively graphics which requires colour management, and there is none in Wayland and no way to calibrate a monitor. There is a sketch of a protocol published 2 months ago which has had no evaluation by expert specialists in the subject (and it is a very complicated subject). Essentially it means that a) the monitor shows the colours it should and b) applications show the images without variation from one to another and if you embed the profile in an image it comes out of a printer looking more or less as intended.