Docker - lost nvidia/cuda after power outage

Docker - lost nvidia/cuda after power outage - Printable Version

+- Jellyfin Forum (https://forum.jellyfin.org)
+-- Forum: Support (https://forum.jellyfin.org/f-support)
+--- Forum: Troubleshooting (https://forum.jellyfin.org/f-troubleshooting)
+--- Thread: Docker - lost nvidia/cuda after power outage (/t-docker-lost-nvidia-cuda-after-power-outage)

Docker - lost nvidia/cuda after power outage - turbochamp - 2024-09-28

My working docker container is suddenly throwing cuda errors after a power outage. Nothing else has changed, worked for months till this damn storm. nvidia-container-toolkit is installed, any thoughts?

Quote:[AVHWDeviceContext @ 0x64df00579a40] Cannot load libcuda.so.1
[AVHWDeviceContext @ 0x64df00579a40] Could not dynamically load CUDA
Device creation failed: -1.
Failed to set value 'cuda=cu:0' for option 'init_hw_device': Operation not permitted
Error parsing global options: Operation not permitted

And here's my docker-compose:

Quote:services:
jellyfin:
image: jellyfin/jellyfin
container_name: jellyfin
user: 962:962
network_mode: "host"
environment:
- JELLYFIN_CACHE_DIR=/var/cache/jellyfin
- JELLYFIN_CONFIG_DIR=/etc/jellyfin
- JELLYFIN_DATA_DIR=/var/lib/jellyfin
- JELLYFIN_LOG_DIR=/var/log/jellyfin
- NVIDIA_VISIBLE_DEVICES=all
- NVIDIA_DRIVER_CAPABILITIES=all
volumes:
- /etc/jellyfinetc/jellyfin
- /var/cache/jellyfinvar/cache/jellyfin
- /var/lib/jellyfinvar/lib/jellyfin
- /var/log/jellyfinvar/log/jellyfin
- /mnt/jellyfin12mnt/jellyfin12
- /mnt/Media-SSDmnt/Media-SSD
- /mnt/jellyfin14mnt/jellyfin14
- /mnt/jellyfin14-2mnt/jellyfin14-2
- /mnt/jellyfin22mnt/jellyfin22
- /mnt/jellyfin22-2mnt/jellyfin22-2
restart: "unless-stopped"
deploy:
resources:
reservations:
devices:
- capabilities: [gpu]

RE: Docker - lost nvidia/cuda after power outage - TheDreadPirate - 2024-09-28

What is the output of nvidia-smi in the container?

Code:
docker exec -it jellyfin nvidia-smi

RE: Docker - lost nvidia/cuda after power outage - turbochamp - 2024-09-28

(2024-09-28, 03:14 AM)TheDreadPirate Wrote: What is the output of nvidia-smi in the container?

Code:
docker exec -it jellyfin nvidia-smi

Thanks for quick reply, here is the output:

Code:
OCI runtime exec failed: exec failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown

RE: Docker - lost nvidia/cuda after power outage - TheDreadPirate - 2024-09-28

Try reinstalling the nvidia container toolkit.

RE: Docker - lost nvidia/cuda after power outage - turbochamp - 2024-09-28

(2024-09-28, 03:53 AM)TheDreadPirate Wrote: Try reinstalling the nvidia container toolkit.

Reinstalled nvidia-container-toolkit and restarted docker (service). However same output/issue.

RE: Docker - lost nvidia/cuda after power outage - TheDreadPirate - 2024-09-28

Code:
sudo apt list --installed | egrep -i "nvidia|libnv|cuda"

RE: Docker - lost nvidia/cuda after power outage - turbochamp - 2024-09-28

(2024-09-28, 04:27 AM)TheDreadPirate Wrote:
Code:
sudo apt list --installed | egrep -i "nvidia|libnv|cuda"

I am on Arch, but with

Code:
pacman -Qi nvidia libnv cuda

(libnv not found):

Quote:pacman -Qi nvidia libnv cuda
Name : nvidia-dkms-tkg
Version : 560.35.03-258
Description : NVIDIA kernel module sources (DKMS)
Architecture : x86_64
URL : http://www.nvidia.com/
Licenses : custom:NVIDIA
Groups : None
Provides : nvidia=560.35.03 nvidia-dkms nvidia-dkms-tkg=560.35.03 NVIDIA-MODULE
Depends On : dkms nvidia-utils-tkg>=560.35.03 nvidia-libgl pahole
Optional Deps : linux-headers [installed]
linux-lts-headers: Build the module for LTS Arch kernel
Required By : None
Optional For : None
Conflicts With : nvidia nvidia-dkms
Replaces : None
Installed Size : 80.06 MiB
Packager : Unknown Packager
Build Date : Fri 27 Sep 2024 10:12:50 PM EDT
Install Date : Fri 27 Sep 2024 10:13:36 PM EDT
Install Reason : Explicitly installed
Install Script : No
Validated By : None

error: package 'libnv' was not found
Name : cuda
Version : 12.6.1-1
Description : NVIDIA's GPU programming toolkit
Architecture : x86_64
URL : https://developer.nvidia.com/cuda-zone
Licenses : LicenseRef-NVIDIA-CUDA
Groups : None
Provides : cuda-toolkit cuda-sdk libcudart.so=12-64 libcublas.so=12-64 libcublas.so=12-64 libcusolver.so=11-64 libcusolver.so=11-64 libcusparse.so=12-64 libcusparse.so=12-64
Depends On : opencl-nvidia python gcc13
Optional Deps : gdb: for cuda-gdb [installed]
glu: required for some profiling tools in CUPTI [installed]
nvidia-utils: for NVIDIA drivers (not needed in CDI containers) [installed]
rdma-core: for GPUDirect Storage (libcufile_rdma.so)
Required By : None
Optional For : openmpi openucx sunshine-git
Conflicts With : None
Replaces : cuda-toolkit cuda-sdk cuda-static
Installed Size : 4.72 GiB
Packager : Jakub Klinkovský <lahwaacz@archlinux.org>
Build Date : Fri 30 Aug 2024 12:39:20 PM EDT
Install Date : Fri 30 Aug 2024 06:00:45 PM EDT
Install Reason : Explicitly installed
Install Script : Yes
Validated By : Signature

RE: Docker - lost nvidia/cuda after power outage - crobibero - 2024-09-28

Try updating your docker-compose to specify nvidia devices.

Formatting may be off since I pasted from my phone

Code:
deploy:

  resources:

    reservations:

      devices:

          - driver: nvidia

             count: all

             capabilities: [gpu]

RE: Docker - lost nvidia/cuda after power outage - turbochamp - 2024-09-28

(2024-09-28, 09:43 AM)crobibero Wrote: Try updating your docker-compose to specify nvidia devices.

I get this error when restarting docker. Line 30 is the "count: all" line. Here's my changes:

Code:
yaml: line 30: mapping values are not allowed in this context

Code:
    deploy:

      resources:

        reservations:

          devices:

            - driver: nvidia

              count: all

              capabilities: [gpu]

RE: Docker - lost nvidia/cuda after power outage - turbochamp - 2024-09-28

I guess my formatting was off. It now appears to be fixed, it works!

Here's with the right formatting:

Code:
deploy:

      resources:

        reservations:

          devices:

            - driver: nvidia

              count: all

              capabilities: [gpu]

Then restarted docker-compose down, restarted docker service and docker-compose up