Jellyfin Forum
New system boots with a380 GPU wedged and will not playback video - Printable Version

+- Jellyfin Forum (https://forum.jellyfin.org)
+-- Forum: Support (https://forum.jellyfin.org/f-support)
+--- Forum: Troubleshooting (https://forum.jellyfin.org/f-troubleshooting)
+--- Thread: New system boots with a380 GPU wedged and will not playback video (/t-new-system-boots-with-a380-gpu-wedged-and-will-not-playback-video)

Pages: 1 2


New system boots with a380 GPU wedged and will not playback video - aj_pinner - 2024-04-29

Hi Jellyfinners,

I just build a new Jellyfin media server with a dedicated a380 GPU for transcoding. I followed the guide here exactly: https://jellyfin.org/docs/general/administration/hardware-acceleration/intel.

When I boot around 1 time out of 5 my GPU does not work at all. I see a Failed to initialize GPU, declaring it wedged! error in the kernel log. When this error happens, ffmpeg errors out and the jellyfin client can't playback the video.

Sometime rebooting will fix the issue and Jellyfin works as expected, sometimes rebooting will still reboot with a wedged GPU. Either way I can't have a media server that only works 20% of the time. I believe the problem is with the last part of the documentation: https://jellyfin.org/docs/general/administration/hardware-acceleration/intel#configure-and-verify-lp-mode-on-linux. Has anyone successfully followed the doc and got a working system? Is it an intel driver issue or is it possibly a bad GPU? Should I try another driver, does anyone know a stable version?

System info:
Version: 10.8.13 from official docker image
Host: Ubuntu Server 22.04 with  Hardware Enablement Stack and firmware-linux-nonfree driver
Kernel: Linux itx 6.5.0-28-generic #29~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Apr  4 14:39:20 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
Graphics card: ASRock Challenger A380

Here are some logs when the system will not playback video:

The key message here is: *ERROR* GT0: Failed to initialize GPU, declaring it wedged!

dmesg | grep i915:
Code:
[    1.919628] i915 0000:03:00.0: vgaarb: deactivate vga console
[    1.919672] i915 0000:03:00.0: [drm] Local memory IO size: 0x000000017c800000
[    1.919675] i915 0000:03:00.0: [drm] Local memory available: 0x000000017c800000
[    1.933363] i915 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[    1.936221] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
[    1.982883] i915 0000:03:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.5.1
[    1.982887] i915 0000:03:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.3
[    3.018230] i915 0000:03:00.0: [drm] GT0: GUC: load failed: status = 0x80000534, time = 1001ms, freq = 2400MHz, ret = -110
[    3.018261] i915 0000:03:00.0: [drm] GT0: GUC: load failed: status: Reset = 0, BootROM = 0x1A, UKernel = 0x05, MIA = 0x00, Auth = 0x02
[    3.018278] i915 0000:03:00.0: [drm] GT0: GUC: still extracting hwconfig table.
[    3.018755] i915 0000:03:00.0: [drm] *ERROR* GT0: GuC initialization failed -ETIMEDOUT
[    3.018765] i915 0000:03:00.0: [drm] *ERROR* GT0: Enabling uc failed (-5)
[    3.018773] i915 0000:03:00.0: [drm] *ERROR* GT0: Failed to initialize GPU, declaring it wedged!
[    3.025801] i915 0000:03:00.0: [drm:add_taint_for_CI [i915]] CI tainted:0x9 by intel_gt_set_wedged_on_init+0x34/0x50 [i915]
[    3.084559] [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 0
[    3.126572] fbcon: i915drmfb (fb0) is primary device
[    3.228997] i915 0000:03:00.0: [drm] fb0: i915drmfb frame buffer device
[    5.080356] mei_gsc i915.mei-gscfi.768: cl:host=01 me=32 fw disconnect request received
[    5.080383] mei i915.mei-gscfi.768-e2c2afa2-3817-4d19-9d95-06b16b588a5d: cannot connect
[    5.083341] mei_gsc i915.mei-gscfi.768: FW not ready: resetting: dev_state = 2 pxp = 0
[    5.083404] mei_gsc i915.mei-gscfi.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[    5.083475] mei_gsc i915.mei-gsc.768: FW not ready: resetting: dev_state = 2 pxp = 2
[    5.083499] mei_gsc i915.mei-gsc.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[    5.167953] snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
[    5.469923] i915 0000:03:00.0: [drm] *ERROR* failed to load huc via gsc -8
[    5.469940] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: failed to bind 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915]): -8
[    5.470322] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: adev bind failed: -8
[    5.470776] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: Master comp add failed -8
[    5.470780] mei_pxp: probe of i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1 failed with error -8


Researching this 'wedged issue', there are posts going back 2014 with users reporting the error but it is always about much older kernels and trying to get this working on iGPUs.

Here are the driver version:
Code:
28 -rw-r--r-- 1 root root  25716 Feb 21 09:32 icl_dmc_ver1_07.bin
28 -rw-r--r-- 1 root root  25952 Feb 21 09:32 icl_dmc_ver1_09.bin
372 -rw-r--r-- 1 root root 380096 Feb 21 09:32 icl_guc_32.0.3.bin
380 -rw-r--r-- 1 root root 385280 Feb 21 09:32 icl_guc_33.0.0.bin
320 -rw-r--r-- 1 root root 324160 Feb 21 09:32 icl_guc_49.0.1.bin
320 -rw-r--r-- 1 root root 327488 Feb 21 09:32 icl_guc_62.0.0.bin
336 -rw-r--r-- 1 root root 343360 Feb 21 09:32 icl_guc_69.0.3.bin
272 -rw-r--r-- 1 root root 274496 Feb 21 09:32 icl_guc_70.1.1.bin
488 -rw-r--r-- 1 root root 498880 Feb 21 09:32 icl_huc_9.0.0.bin
480 -rw-r--r-- 1 root root 488960 Feb 21 09:32 icl_huc_ver8_4_3238.bin

Jellyfin log when GPU is in the state:
Code:
[10:56:46] [ERR] [360] Jellyfin.Server.Middleware.ExceptionMiddleware: Error processing request. URL GET /videos/6eedea3a-5b2a-6f34-4bf5-fc38689342f6/hls1/main/0.ts.
MediaBrowser.Common.FfmpegException: FFmpeg exited with code 1


Client error when GPU is in this state:
Code:
The client isn't compatible with the media and the server isn't sending a compatible media format.

I tried this twice, and reconfigured the entire server and got the same results the second time- 4/5 boots works and hardware transcoding appear to work as normal, I get 600fps however 20% of the time I get a unusable GPU. If I made a mistake in following the doc where would it be? Any information that can help troubleshoot would be greatly appreciated as I need to make a decision to return the GPU in 2 weeks if it is bad hardware.


RE: New system boots with a380 GPU wedged and will not playback video - TheDreadPirate - 2024-04-29

I have an A380 in my server and was running 22.04 with the 6.5 HWE kernel for a while. I didn't have anything happen like what you are describing.

Do you have more than 1 GPU in the system? Including an Intel iGPU?


RE: New system boots with a380 GPU wedged and will not playback video - aj_pinner - 2024-04-29

Hi TheDreadPirate I was hoping to hear from you since I saw you commenting on other Intel ARC threads, thanks for replying. I have a few questions for you that might help me narrow down the issue since we have the same GPU and a similar setup. I picked the hardware based Jellyfin recommendations but I am not having much luck so far.

No I only have the A380. I bought an F series processor so I only have the GPU for all video. This build is just for jellyfin.

Initially it looked like 20% of the boots returned the GPU Wedged error. I wrote a startup script that would look at the kenel log and reboot if it saw the GPU wedged message but now its happening on every boot so caught in a boot loop.

I realize that I may have made a config mistake, in the Intel GPU instructions here:

Configure And Verify LP Mode On Linux

"This also applies to the bleeding edge hardware such as 12th Gen Intel processors, ARC GPU and newer but step 2 should be skipped."

So the instructions describe skipping step 2 which is adding a kernel module with this argument:

Code:
sudo sh -c "echo 'options i915 enable_guc=2' >> /etc/modprobe.d/i915.conf"
sudo update-initramfs -u && sudo update-grub

Once I realized this, I removed the i915.conf file and ran:

Code:
sudo update-initramfs -u && sudo update-grub

again, can you confirm that doing this would have updated the kernel again and removed the options i915 enable_guc=2 from the kernel or could this be responsible for the problems I am having? I have limited knowledge of this area.

I ran the sudo apt update && sudo apt install -y firmware-linux-nonfree to install the latest driver, if I want to uninstall this driver and revert to the original driver, will hardware encoding work with this GPU? Does the HUC firmware exist int the 6.5.0-28-generic kernel or do we definately need this new driver?

Another thing I was curious about is the fan on this GPU, most of the time it does not spin. Every 20-30 seconds or so it will spin for 10 seconds and then stop. If I do boot into a good state, even if it is transcoding, the fan does spins more often and longer but it still keeps stopping, is this normal? Does yours do this? I saw a post on reddit of a user who described rewiring his fan because it was doing something similar but couldn't find and more info.

Also, I don't know how to get the GPU temperature on Linux 22.04. The intel_gpu_top tool does not show temperature. Is there a way to do it?

Thanks.


RE: New system boots with a380 GPU wedged and will not playback video - TheDreadPirate - 2024-04-29

I'm on 24.04 with a newer version of intel_gpu_top and it still doesn't report temps.

I pretty much didn't have to do any of those LP steps with Arc. Just enabled Low Power encoding in Jellyfin. No issues with transcoding or tone mapping.

Code:
[    4.378986] i915 0000:03:00.0: [drm] VT-d active for gfx access
[    4.465957] i915 0000:03:00.0: vgaarb: deactivate vga console
[    4.465996] i915 0000:03:00.0: [drm] Local memory IO size: 0x000000017c800000
[    4.465998] i915 0000:03:00.0: [drm] Local memory available: 0x000000017c800000
[    4.479693] i915 0000:03:00.0: vgaarb: VGA decodes changed: olddecodes=io+mem,decodes=io+mem:owns=none
[    4.483869] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
[    4.487752] i915 0000:03:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.20.0
[    4.487757] i915 0000:03:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.3
[    4.496249] i915 0000:03:00.0: [drm] GT0: GUC: submission enabled
[    4.496251] i915 0000:03:00.0: [drm] GT0: GUC: SLPC enabled
[    4.496481] i915 0000:03:00.0: [drm] GT0: GUC: RC enabled
[    4.563149] [drm] Initialized i915 1.6.0 20230929 for 0000:03:00.0 on minor 1
[    4.563922] i915 display info: display version: 13
[    4.563924] i915 display info: cursor_needs_physical: no
[    4.563925] i915 display info: has_cdclk_crawl: no
[    4.563926] i915 display info: has_cdclk_squash: yes
[    4.563926] i915 display info: has_ddi: yes
[    4.563927] i915 display info: has_dp_mst: yes
[    4.563928] i915 display info: has_dsb: yes
[    4.563928] i915 display info: has_fpga_dbg: yes
[    4.563929] i915 display info: has_gmch: no
[    4.563930] i915 display info: has_hotplug: yes
[    4.563930] i915 display info: has_hti: no
[    4.563931] i915 display info: has_ipc: yes
[    4.563932] i915 display info: has_overlay: no
[    4.563932] i915 display info: has_psr: yes
[    4.563933] i915 display info: has_psr_hw_tracking: no
[    4.563934] i915 display info: overlay_needs_physical: no
[    4.563934] i915 display info: supports_tv: no
[    4.563935] i915 display info: has_hdcp: yes
[    4.563936] i915 display info: has_dmc: yes
[    4.563936] i915 display info: has_dsc: yes
[    4.589795] snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
[    4.591429] fbcon: i915drmfb (fb0) is primary device
[    4.662768] i915 0000:03:00.0: [drm] fb0: i915drmfb frame buffer device
[    5.345749] i915 0000:03:00.0: [drm] GT0: HuC: authenticated for all workloads
[    5.345764] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])

I can't really find anything conclusive online.

Ensure your boards BIOS are up to date. Enable resizeable BAR in your BIOS. Try turning off SRIOV in your BIOS.


RE: New system boots with a380 GPU wedged and will not playback video - aj_pinner - 2024-04-29

Okay so with 24.04 you wouldn't have to install the linux-firmware or the HWE kernel, I think everything should be fully supported?

I just tried booting from usb as a test, the latest KDE Neon which is Ubuntu 22.04.4 so same kernel version without any of those mods and saw the GPU wedged error as well so this sort of rules out the config.

I think I should try to update to 24.04 and see what happens. Did you upgrade or clean install? I hear a lot of horror stories upgrading to 22.04.

In Bios, resizable bar is on, I will try to disable SRIOV.

How about your fan does it spin as I described or is it more constant?


RE: New system boots with a380 GPU wedged and will not playback video - TheDreadPirate - 2024-04-29

When I upgraded to 24.04 it was a clean install only because I was also upgrading the SSD (oooooold Intel 160GB SATA2 SSD to NVMe SSD in signature).

Correct. 24.04 is on kernel 6.8 by default so fully supports Arc out of the box. The linux-firmware package was already installed out of the box.

I have not peeked inside my case to check the GPU fan nor do I care. My server sits in my utility closet doings its thing.


RE: New system boots with a380 GPU wedged and will not playback video - aj_pinner - 2024-04-29

SRIOV was already off so that wasn't it.

Also, what size power supply do you have? I have a 400W which should be enough, gen12-f i5 cpu, I have not overclocked anything, no extra fans, or other peripherals, no SATA drives, stock cpu cooler. I wonder if it is enough for this ASRock a380 card.


RE: New system boots with a380 GPU wedged and will not playback video - aj_pinner - 2024-04-29

This is my goal, to get the server stable enough to sit there an do it's thing without my intervention. When it boots successfully, as far as I can tell it works. The CPU seems a little higher than expected when I am transcoding for fps shows 600+ so that seems positive Smiling-face Unfortunately when it throws this GPU wedged error it doesn't work at all so I can't tell if its a software problem or a hardware problem.


After like 50 failed boots, I just had a successful one:

Code:
[    1.447596] i915 0000:03:00.0: vgaarb: deactivate vga console
[    1.447613] i915 0000:03:00.0: [drm] Local memory IO size: 0x000000017c800000
[    1.447615] i915 0000:03:00.0: [drm] Local memory available: 0x000000017c800000
[    1.461206] i915 0000:03:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=none
[    1.464118] i915 0000:03:00.0: [drm] Finished loading DMC firmware i915/dg2_dmc_ver2_08.bin (v2.8)
[    1.485580] i915 0000:03:00.0: [drm] GT0: GuC firmware i915/dg2_guc_70.bin version 70.5.1
[    1.485584] i915 0000:03:00.0: [drm] GT0: HuC firmware i915/dg2_huc_gsc.bin version 7.10.3
[    1.497628] i915 0000:03:00.0: [drm] GT0: GUC: submission enabled
[    1.497632] i915 0000:03:00.0: [drm] GT0: GUC: SLPC enabled
[    1.497858] i915 0000:03:00.0: [drm] GT0: GUC: RC enabled
[    1.517399] [drm] Initialized i915 1.6.0 20201103 for 0000:03:00.0 on minor 0
[    1.550000] fbcon: i915drmfb (fb0) is primary device
[    1.650900] i915 0000:03:00.0: [drm] fb0: i915drmfb frame buffer device
[    4.918698] mei i915.mei-gscfi.768-e2c2afa2-3817-4d19-9d95-06b16b588a5d: cannot connect
[    4.921583] mei_gsc i915.mei-gscfi.768: FW not ready: resetting: dev_state = 2 pxp = 0
[    4.921604] mei_gsc i915.mei-gscfi.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[    4.922336] mei_gsc i915.mei-gsc.768: FW not ready: resetting: dev_state = 2 pxp = 2
[    4.922357] mei_gsc i915.mei-gsc.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[    4.964479] snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
[    5.321272] i915 0000:03:00.0: [drm] GT0: HuC: authenticated for all workloads
[    5.321288] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])

its very inconsistant which makes me think either a timing issue which could be software or hardware problem.


RE: New system boots with a380 GPU wedged and will not playback video - TheDreadPirate - 2024-04-29

The A380 is a 75w(?) GPU and transcoding does not use that much power. Your PSU is plenty.

Try reseating the GPU and power cables. Maybe it is not fully seated or something.


RE: New system boots with a380 GPU wedged and will not playback video - aj_pinner - 2024-04-29

Yes my thinking as well, not PS.

I will try to reseat. I did notice a difference between you kernlog and mine. On the last successful boot I see these lines:

Code:
[    4.918698] mei i915.mei-gscfi.768-e2c2afa2-3817-4d19-9d95-06b16b588a5d: cannot connect
[    4.921583] mei_gsc i915.mei-gscfi.768: FW not ready: resetting: dev_state = 2 pxp = 0
[    4.921604] mei_gsc i915.mei-gscfi.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[    4.922336] mei_gsc i915.mei-gsc.768: FW not ready: resetting: dev_state = 2 pxp = 2
[    4.922357] mei_gsc i915.mei-gsc.768: unexpected reset: dev_state = ENABLED fw status = 00000345 84670000 00000000 00000000 E0020002 00000000
[    4.964479] snd_hda_intel 0000:04:00.0: bound 0000:03:00.0 (ops i915_audio_component_bind_ops [i915])
[    5.321272] i915 0000:03:00.0: [drm] GT0: HuC: authenticated for all workloads
[    5.321288] mei_pxp i915.mei-gsc.768-fbf6fcf1-96cf-4e2e-a6a6-1bab8cbe36b1: bound 0000:03:00.0 (ops i915_pxp_tee_component_ops [i915])

So it says it cannot connect, it resets, unexpected reset, resets again then it is successful. In this state it works fully but in your log you just get the successful line at the end once of the sequence.

I have always seen this behavior, it always errors twice then connects successfully on the 3rd try so I was thinking some sort of timing issue.

Have you ever seen this before or did yours always connect successfully the first time?