CUDA updates - Printable Version

CUDA updates - Printable Version

+- Jellyfin Forum (https://forum.jellyfin.org)
+-- Forum: Support (https://forum.jellyfin.org/f-support)
+--- Forum: General Questions (https://forum.jellyfin.org/f-general-questions)
+--- Thread: CUDA updates (/t-cuda-updates)

CUDA updates - k5rqo - 2024-06-23

Hi, I know this is not fully related to jellyfin but i don't know where else i'd ask so i'm asking my question here.

I have jellyfin running in a docker container using the official docker image, i am passing through my nvidia tesla gpu as described in the jellyfin documentation for gpu passthrough. I have the correct drivers and nvidia-container-toolkit installed on my host (debian bookworm).

This works fine most of the time, but sometimes, ffmpeg fails saying there is no cuda device available. I have attributed this to the drivers being updated on the host by unattended-upgrades, but whenever i get the ffmpeg error, i can't find any logs of any nvidia component being updated.

Am i missing something here?

RE: CUDA updates - TheDreadPirate - 2024-06-23

I remember another user had this problem months ago. I don't recall what the solution was, if one was even found. And I can't find the thread at the moment.

RE: CUDA updates - pcm - 2024-06-24

I'd start at syslog and dmesg in the container to see what's going on when the error happens. If there's nothing in the container's syslogs/dmesg then i'd check host's dmesg.

Another thing you could do is enable nvlog.

Quote: I have attributed this to the drivers being updated on the host by unattended-upgrades, but whenever i get the ffmpeg error, i can't find any logs of any nvidia component being updated.

IMHO unattended upgrade should not cause such behavior (atleast not for me and I am way behind on my upgrade for my gpu)... It could be an actual hardware issue (with your specific GPU) or could be a bug with your specific GPU device driver (either in the passthru module or somewhere else)...

RE: CUDA updates - k5rqo - 2024-06-24

(2024-06-24, 04:29 PM)pcm Wrote: I'd start at syslog and dmesg in the container to see what's going on when the error happens. If there's nothing in the container's syslogs/dmesg then i'd check host's dmesg.

I don't think the container allows this, as it's good practice to lock containers down as much as possible.

(2024-06-24, 04:29 PM)pcm Wrote: Another thing you could do is enable nvlog.

I can't find anything about this online, could you explain a bit more?

(2024-06-24, 04:29 PM)pcm Wrote: IMHO unattended upgrade should not cause such behavior (atleast not for me and I am way behind on my upgrade for my gpu)... It could be an actual hardware issue (with your specific GPU) or could be a bug with your specific GPU device driver (either in the passthru module or somewhere else)...

I do actually think this could be caused by a driver upgrade, the container has a loaded library that communicates with the docker passed through device, if the host driver suddenly changes, the library can't communicate with the gpu anymore as it suddenly uses a mismatched driver.

RE: CUDA updates - pcm - 2024-06-24

Now that you mention it, that does make sense.
But, wouldn't restarting the container the image fix the issue ? containers are meant to be ephemeral anyways...
Does the host machine capture any dmesg logs ?

It's nvidia-debugdump command. I just had an alias setup... mybad.

RE: CUDA updates - k5rqo - 2024-06-24

(2024-06-24, 08:16 PM)pcm Wrote: Now that you mention it, that does make sense.
But, wouldn't restarting the container the image fix the issue ? containers are meant to be ephemeral anyways...

Yes that does fix it, but my problem is that i wanna know what causes the sudden driver update. :)

(2024-06-24, 08:16 PM)pcm Wrote: Does the host machine capture any dmesg logs ?

I'll try to spot something next time it occurs.

(2024-06-24, 08:16 PM)pcm Wrote: It's nvidia-debugdump command. I just had an alias setup... mybad.

All good, i'll try that too.

RE: CUDA updates - CleverId10t - 2024-06-26

I have experienced this, and turning off auto updates "fixed" it (as did a reboot of the docker host).

As I had a simple solution (turning off auto update), I didn't bother investigating further.

RE: CUDA updates - k5rqo - 2024-06-27

(2024-06-26, 09:31 PM)CleverId10t Wrote: I have experienced this, and turning off auto updates "fixed" it (as did a reboot of the docker host).

What method did you use for auto updates?

RE: CUDA updates - k5rqo - 2024-06-30

I just encountered the issue again, it seems i wasn't able to find previous auto installations of nvidia related packages, because the unattended-upgrade log would be overwritten each time unattended-upgrade ran. I will now blacklist these packages from auto updating by doing the following:

Code:
#/etc/apt/apt.conf.d/50unattended-upgrades

Unattended-Upgrade::Package-Blacklist {

       ".*nvidia.*"

}

Even if one of you would like to keep nvidia auto updated, it won't work nicely with unattended-upgrade (it's nvidia after all). I think the best solution for everyone is manually updating them once in a while.