Intel Arc B580 / OpenCL / darktable freezes/hangs/crashes/'engine reset' - possible kernel or firmware bug?

Good day all,

For a while now I’ve been having some weird freezes/hangs/crashes in darktable when it uses OpenCL (a RAW photo processor) on Tumbleweed.

At first I thought it was an issue with darktable itself, then figured it was probably a broader OpenCL issue as it persisted through various updates of darktable.

But at the same time it persisted through a number of graphics driver updates, Mesa updates and other (possibly) related packages. So it got to the point where I just couldn’t use my discrete GPU for my photo editing anymore.

OpenCL in darktable works on my iGPU (AMD Ryzen 7 7700 with RDNA-2 arch iGPU). Not an option for daily use though, much too slow, just wanted to confirm the issue was related to the Intel GPU specifically.

OpenCL in darktable works on Kubuntu 25.10 on my Intel Arc B580 and AMD iGPU.

Doesn’t matter which version of darktable I use by the way. Even compiled a version from source and it had the exact same issue.

So I think it’s safe to assume the issue is related the kernel drivers (Xe/i915) for the Intel GPU in Tumbleweed.

Removing all darktable related directories/files/config and OpenCL related packages, rebooting and installing them again hasn’t fixed anything.

Searching the internet hasn’t yielded any similar cases, at least nothing recent and this specific, so I dug a little deeper.

sudo dmesg:

xe 0000:03:00.0: [drm] Tile0: GT0: Engine reset: engine_class=ccs, logical_mask: 0x1, guc_id=95

[ 4870.511420] [ T80085] xe 0000:03:00.0: [drm] Xe device coredump has been created</i>

[ 4870.511423] [ T80085] xe 0000:03:00.0: [drm] Check your /sys/class/drm/card0/device/devcoredump/data

journalctl | grep ‘drm’:

Dec 28 20:13:01 Tumbleweed kernel: xe 0000:03:00.0: [drm] Xe device coredump has been deleted. 
Dec 28 20:31:02 Tumbleweed kernel: xe 0000:03:00.0: [drm] Tile0: GT0: Engine reset: engine_class=ccs, logical_mask: 0x1, guc_id=113 
Dec 28 20:31:02 Tumbleweed kernel: xe 0000:03:00.0: [drm] Xe device coredump has been created 
Dec 28 20:31:02 Tumbleweed kernel: xe 0000:03:00.0: [drm] Check your /sys/class/drm/card0/device/devcoredump/data 
Dec 28 21:31:40 Tumbleweed kernel: xe 0000:03:00.0: [drm] Xe device coredump has been deleted. 
Dec 28 22:22:22 Tumbleweed kernel: xe 0000:03:00.0: [drm] Tile0: GT0: Engine reset: engine_class=ccs, logical_mask: 0x1, guc_id=93 
Dec 28 22:22:22 Tumbleweed kernel: xe 0000:03:00.0: [drm] Xe device coredump has been created 
Dec 28 22:22:22 Tumbleweed kernel: xe 0000:03:00.0: [drm] Check your /sys/class/drm/card0/device/devcoredump/data 
Dec 28 22:38:26 Tumbleweed kernel: xe 0000:03:00.0: [drm] Tile0: GT0: Engine reset: engine_class=ccs, logical_mask: 0x1, guc_id=30 
Dec 28 22:48:33 Tumbleweed kernel: xe 0000:03:00.0: [drm] Tile0: GT0: Engine reset: engine_class=ccs, logical_mask: 0x1, guc_id=30

So I had a look at the coredump which contained among other things:

***** GuC Log *****

GuC firmware: xe/bmg_guc_70.bin

**GuC version: 70.55.3 (wanted 70.49.4)**

*Kernel timestamp: 0x46E27177B19 [4871148763929]*

GuC timestamp: 0x15E17DA2E6 [93977420518]

I can only guess as to what it all means. Firmware mis-match? Kernel driver bug? Something totally different? I won’t bother speculating any further as this is way beyond what I know.

A quick search did suggest, possibly, that similar issues have occurred before and if I interpret it correctly it was/is a kernel bug.

That’s about as far as I’ve gotten after hours and hours of poking and prodding.

I hope someone here can make sense of this and maybe point me in the right direction. Honestly I’m not even sure this is the right place to ask since this is such a weird problem.

From where I’m sitting and with my lack of knowledge this could be a kernel bug that affects multiple Linux distros, various Intel GPUs and maybe it’s so specific it has largely gone unnoticed.
It could just as well be a firmware bug I guess. The log entries do seem to point at something firmware related, right?
I wasn’t going to speculate any more…

Anyway, if anybody has a clue I’d love to hear about it.

If I’m in the wrong place asking this question, I apologize.

If more info is required, please let me know and I’ll do my best to get back to you asap.

System:
  Kernel: 6.18.2-1-default arch: x86_64 bits: 64 compiler: gcc v: 15.2.1
    clocksource: tsc avail: hpet,acpi_pm
    parameters: BOOT_IMAGE=/boot/vmlinuz-6.18.2-1-default
    root=/dev/mapper/system-root splash=silent quiet security=selinux
    selinux=1 enforcing=1 amd_pstate.shared_mem=1 amd_pstate=active
    acpi_enforce_resources=lax mitigations=auto
  Desktop: KDE Plasma v: 6.5.4 tk: Qt v: N/A info: frameworks v: 6.21.0
    wm: kwin_wayland tools: avail: xscreensaver vt: 3 dm: SDDM Distro: openSUSE
    Tumbleweed 20251227
Graphics:
  Device-1: Intel Battlemage G21 [Arc B580] driver: xe v: kernel arch: Xe2
    process: TSMC n4 (4nm) built: 2024+ pcie: gen: 1 speed: 2.5 GT/s lanes: 1
    ports: active: HDMI-A-2,HDMI-A-3 empty: DP-1, DP-2, DP-3, HDMI-A-1,
    HDMI-A-4 bus-ID: 03:00.0 chip-ID: 8086:e20b class-ID: 0300
  Display: wayland server: X.org v: 1.21.1.21 with: Xwayland v: 24.1.8
    compositor: kwin_wayland driver: X: loaded: modesetting unloaded: vesa
    alternate: fbdev,intel dri: iris gpu: xe d-rect: 3840x1080 display-ID: 0
  Monitor-1: HDMI-A-2 pos: right model: LG (GoldStar) IPS FULLHD built: 2014
    res: mode: 1920x1080 hz: 60 scale: 100% (1) dpi: 102 gamma: 1.2
    size: 480x270mm (18.9x10.63") diag: 551mm (21.7") ratio: 16:9 modes:
    max: 1920x1080 min: 720x400
  Monitor-2: HDMI-A-3 pos: primary,left model: Denon DENON-AVR
    serial: <filter> built: 2022 res: mode: 1920x1080 hz: 60 scale: 100% (1)
    dpi: 61 gamma: 1.2 size: 1600x900mm (62.99x35.43") diag: 1836mm (72.3")
    ratio: 16:9 modes: max: 3840x2160 min: 720x400
  API: EGL v: 1.5 hw: drv: intel iris platforms: device: 0 drv: iris
    device: 1 drv: swrast gbm: drv: iris surfaceless: drv: iris wayland:
    drv: iris x11: drv: iris
  API: OpenGL v: 4.6 compat-v: 4.5 vendor: intel mesa v: 25.3.1 glx-v: 1.4
    direct-render: yes renderer: Mesa Intel Arc B580 Graphics (BMG G21)
    device-ID: 8086:e20b memory: 11.65 GiB unified: no display-ID: :1.0
  API: Vulkan v: 1.4.335 layers: 4 device: 0 type: discrete-gpu name: Intel
    Arc B580 Graphics (BMG G21) driver: mesa intel v: 25.3.1
    device-ID: 8086:e20b surfaces: N/A device: 1 type: cpu name: llvmpipe
    (LLVM 21.1.6 256 bits) driver: mesa llvmpipe v: 25.3.1 (LLVM 21.1.6)
    device-ID: 10005:0000 surfaces: N/A
  Info: Tools: api: clinfo, eglinfo, glxinfo, vulkaninfo
    de: kscreen-console,kscreen-doctor gpu: amd-smi, lact, radeontop
    wl: wayland-info x11: xdpyinfo, xprop, xrandr
03:00.0 VGA compatible controller: Intel Corporation Battlemage G21 [Arc B580]
        Subsystem: Device 172f:4215
        Kernel driver in use: xe
        Kernel modules: xe
03:00.0 VGA compatible controller [0300]: Intel Corporation Battlemage G21 [Arc B580] [8086:e20b] (prog-if 00 [VGA controller])
        Subsystem: Device [172f:4215]
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 0, Cache Line Size: 64 bytes
        Interrupt: pin ? routed to IRQ 103
        IOMMU group: 15
        Region 0: Memory at f5000000 (64-bit, non-prefetchable) [size=16M]
        Region 2: Memory at f800000000 (64-bit, prefetchable) [size=16G]
        Expansion ROM at f6000000 [disabled] [size=2M]
        Capabilities: [40] Vendor Specific Information: Intel Capabilities v1
                CapA: Peg60Dis- Peg12Dis- Peg11Dis- Peg10Dis- PeLWUDis- DmiWidth=x4
                      EccDis- ForceEccEn- VTdDis- DmiG2Dis- PegG2Dis- DDRMaxSize=Unlimited
                      1NDis- CDDis- DDPCDis- X2APICEn- PDCDis- IGDis- CDID=0 CRID=0
                      DDROCCAP- OCEn- DDRWrtVrefEn+ DDR3LEn+
                CapB: ImguDis- OCbySSKUCap- OCbySSKUEn- SMTCap- CacheSzCap 0x0
                      SoftBinCap- DDR3MaxFreqWithRef100=Disabled PegG3Dis-
                      PkgTyp- AddGfxEn- AddGfxCap- PegX16Dis- DmiG3Dis- GmmDis-
                      DDR3MaxFreq=2932MHz LPDDR3En-
        Capabilities: [70] Express (v2) Endpoint, IntMsgNum 0
                DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
                        ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset+ SlotPowerLimit 0W TEE-IO-
                DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
                        RlxdOrd+ ExtTag+ PhantFunc- AuxPwr- NoSnoop+ FLReset-
                        MaxPayload 256 bytes, MaxReadReq 512 bytes
                DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
                LnkCap: Port #0, Speed 2.5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
                        ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
                LnkCtl: ASPM L1 Enabled; RCB 64 bytes, LnkDisable- CommClk-
                        ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt- FltModeDis-
                LnkSta: Speed 2.5GT/s, Width x1
                        TrErr- Train- SlotClk- DLActive- BWMgmt- ABWMgmt-
                DevCap2: Completion Timeout: Range B, TimeoutDis+ NROPrPrP- LTR+
                         10BitTagComp+ 10BitTagReq+ OBFF Not Supported, ExtFmt+ EETLPPrefix-
                         EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
                         FRS- TPHComp- ExtTPHComp-
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-
                DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-
                         AtomicOpsCtl: ReqEn-
                         IDOReq- IDOCompl- LTR+ EmergencyPowerReductionReq-
                         10BitTagReq- OBFF Disabled, EETLPPrefixBlk-
                LnkCap2: Supported Link Speeds: 2.5GT/s, Crosslink- Retimer- 2Retimers- DRS-
                LnkCtl2: Target Link Speed: 2.5GT/s, EnterCompliance- SpeedDis-
                         Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
                         Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
                LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
                         EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
                         Retimer- 2Retimers- CrosslinkRes: unsupported, FltMode-
        Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable+ 64bit+
                Address: 00000000fee00000  Data: 0000
                Masking: 00000000  Pending: 00000000
        Capabilities: [d0] Power Management version 3
                Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
                Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
        Capabilities: [100 v1] Alternative Routing-ID Interpretation (ARI)
                ARICap: MFVC- ACS-, Next Function: 0
                ARICtl: MFVC- ACS-, Function Group: 0
        Capabilities: [110 v1] Null
        Capabilities: [200 v1] Address Translation Service (ATS)
                ATSCap: Invalidate Queue Depth: 00
                ATSCtl: Enable+, Smallest Translation Unit: 00
        Capabilities: [420 v1] Physical Resizable BAR
                BAR 2: current size: 16GB, supported: 256MB 512MB 1GB 2GB 4GB 8GB 16GB
        Capabilities: [400 v1] Latency Tolerance Reporting
                Max snoop latency: 1048576ns
                Max no snoop latency: 1048576ns
        Kernel driver in use: xe
        Kernel modules: xe

@Tumblebatch Hi likely don’t have level zero rpms installed, is libze_intel_gpu1 installed, you might wish to add libze_intel_gpu_raytracing which I pushed to a development repository (I have an upstream issue with this and blender, so not willing to push to Factory yet).

https://build.opensuse.org/package/show/X11:XOrg/intel-level-zero-gpu-raytracing

I only have A series GPU’s but can switch to the Xe driver… It’s definately a lot better with Mesa improvements.

Oh and in Darktable Preferences → Processing Only Intel GPU is selected?

Hi @malcolmlewis,

Thanks for your quick reply!

If I remember correctly the issue presented itself with and without libze_intel_gpu1 installed.
In an attempt to resolve things I’ve removed all OpenCL related packages including this one. Currently libze_intel_gpu1is not installed but clinfo reports no issues.

In darktable’s preferences I’ve tried various combinations of settings including only selecting Intel GPU for OpenCL.
I’ve also changed OpenCL related settings including device priority in the darktablerc file. Funny thing was, when I changed these so that the B580 GPU was allowed to be used for exporting photos only, things worked okay-ish.
What else is odd is that so far I’ve only identified 2 specific functions in darktable that cause this behavior - the ‘clipping indication’ toggle/switch and the ‘local contrast’ module. Gets weirder still - the local contrast module only freezes darktable when it is set to use ‘local laplacian filter’. Switching to ‘bilateral grid’ here works fine.
When it freezes/hangs darktable shows a few warnings in the terminal:

   44.9608 [dt_opencl_events_wait_for] reported   CL_EXEC_STATUS_ERROR_FOR_EVENTS_IN_WAIT_LIST for device 0
    44.9608 [opencl_events_flush] execution of 'laplacian_assemble' failed: -777
    44.9608 [opencl_events_flush] execution of 'laplacian_assemble' failed: -777
    44.9608 [opencl_events_flush] execution of 'laplacian_assemble' failed: -777
    44.9608 [opencl_events_flush] execution of 'laplacian_assemble' failed: -777
    44.9608 [opencl_events_flush] execution of 'laplacian_assemble' failed: -777
    44.9608 [opencl_events_flush] execution of 'laplacian_assemble' failed: -777
    44.9608 [opencl_events_flush] execution of 'laplacian_assemble' failed: -777
    44.9608 [opencl_events_flush] execution of 'write_back' failed: -777

Maybe there’s a clue in there, not sure.

So all in all it looks like a very odd and very specific issue which is so far only triggered by darktable.
I did have some trouble with ComfyUI / PyTorch as well a while ago, didn’t think it was related but now I’m beginning to suspect it could be.

I’ll reinstall libze_intel_gpu1 and see what that does, have a look at your link and report back in a bit.

Thanks!

@Tumblebatch have you read this? https://docs.darktable.org/usermanual/development/en/special-topics/opencl/activate-opencl/

Hard to say if it’s the Xe driver, Mesa or Darktable. I’m assuming everything else is working fine with the Xe driver (I’m on GNOME here)? I have a nvidia gpu for Prime Render Offload and most processing…

@malcolmlewis ,

I’ve read just about every scrap of darktable documentation related to OpenCL including your link.
Prior to having this issue I’ve used darktable with OpenCL for several years, mostly with an Intel iGPU and for the past year or so with the Arc B580, and while there have been some issues with OpenCL they usually resolved themselves pretty quickly.
In any case I’m quite familiar with using OpenCL both in darktable and other software, but this is beyond the scope of my knowledge.

I forgot to mention that I ran clpeak and Phoronix OpenCL benchmarks as well as a test, both ran just fine.
At first sight it might look like the issue is caused by something in darktable then, everything else seems fine.
But after reading a bit about these ‘engine reset’ messages in the system logs I began to have my doubts.

This bit: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/2046986 and a few other sources mentioned it as well.

It’s a bit too technical for me to understand it all but it looks similar to what I’m seeing and if I’m not mistaken it was flagged as a kernel bug.
Any idea about this?

Again I’m sorry if this is the wrong place to ask. I’ve mentioned the issue in r/DarkTable as well but so far nothing useful has come from that.
Next stop: PIXLS.US I guess. Maybe darktable’s Github page.
There has to be somebody who might know what’s going on, right? :grimacing:

@Tumblebatch well your GPU is using the newer xe driver as opposed to i915. I suspect it’s a combination of Xe, Mesa and OpenCL support…

That appears to be the case indeed, at least as far as my understanding goes.
I was kind of hoping it was a bug in de Linux kernel (driver) itself that would be picked up by the developers. Similar things have happened before and usually they’re quite quick to squash those bugs, but this particular issue has persisted through I think at least 8-10 kernel updates, so for the moment we can only guess where it’s going sideways I guess.

As for darktable itself, it still works on my Kubuntu installation on the same computer so all is not lost. We’ll just call it a minor inconvenience for the time being. :upside_down_face:

When I have a bit more time (and feel like it…) I’ll look into it further. Probably next year.

@malcolmlewis , thanks for your input. I appreciate it.
If anybody else has an idea or suggestion I’d love to hear it.

@Tumblebatch So what Darktable version, what kernel version, what Mesa version and what driver in use for the GPU on the other distribution?