Jellyfin Forum
Docker - lost nvidia/cuda after power outage - Printable Version

+- Jellyfin Forum (https://forum.jellyfin.org)
+-- Forum: Support (https://forum.jellyfin.org/f-support)
+--- Forum: Troubleshooting (https://forum.jellyfin.org/f-troubleshooting)
+--- Thread: Docker - lost nvidia/cuda after power outage (/t-docker-lost-nvidia-cuda-after-power-outage)



Docker - lost nvidia/cuda after power outage - turbochamp - 2024-09-28

My working docker container is suddenly throwing cuda errors after a power outage. Nothing else has changed, worked for months till this damn storm. nvidia-container-toolkit is installed, any thoughts?

Quote:[AVHWDeviceContext @ 0x64df00579a40] Cannot load libcuda.so.1
[AVHWDeviceContext @ 0x64df00579a40] Could not dynamically load CUDA
Device creation failed: -1.
Failed to set value 'cuda=cu:0' for option 'init_hw_device': Operation not permitted
Error parsing global options: Operation not permitted


And here's my docker-compose:

Quote:services:
  jellyfin:
    image: jellyfin/jellyfin
    container_name: jellyfin
    user: 962:962
    network_mode: "host"
    environment:
      - JELLYFIN_CACHE_DIR=/var/cache/jellyfin
      - JELLYFIN_CONFIG_DIR=/etc/jellyfin
      - JELLYFIN_DATA_DIR=/var/lib/jellyfin
      - JELLYFIN_LOG_DIR=/var/log/jellyfin
      - NVIDIA_VISIBLE_DEVICES=all
      - NVIDIA_DRIVER_CAPABILITIES=all
    volumes:
      - /etc/jellyfinConfused-faceetc/jellyfin
      - /var/cache/jellyfinConfused-facevar/cache/jellyfin
      - /var/lib/jellyfinConfused-facevar/lib/jellyfin
      - /var/log/jellyfinConfused-facevar/log/jellyfin
      - /mnt/jellyfin12Confused-facemnt/jellyfin12
      - /mnt/Media-SSDConfused-facemnt/Media-SSD
      - /mnt/jellyfin14Confused-facemnt/jellyfin14
      - /mnt/jellyfin14-2Confused-facemnt/jellyfin14-2
      - /mnt/jellyfin22Confused-facemnt/jellyfin22
      - /mnt/jellyfin22-2Confused-facemnt/jellyfin22-2
    restart: "unless-stopped"
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]



RE: Docker - lost nvidia/cuda after power outage - TheDreadPirate - 2024-09-28

What is the output of nvidia-smi in the container?

Code:
docker exec -it jellyfin nvidia-smi



RE: Docker - lost nvidia/cuda after power outage - turbochamp - 2024-09-28

(2024-09-28, 03:14 AM)TheDreadPirate Wrote: What is the output of nvidia-smi in the container?

Code:
docker exec -it jellyfin nvidia-smi

Thanks for quick reply, here is the output:

Code:
OCI runtime exec failed: exec failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown



RE: Docker - lost nvidia/cuda after power outage - TheDreadPirate - 2024-09-28

Try reinstalling the nvidia container toolkit.


RE: Docker - lost nvidia/cuda after power outage - turbochamp - 2024-09-28

(2024-09-28, 03:53 AM)TheDreadPirate Wrote: Try reinstalling the nvidia container toolkit.

Reinstalled nvidia-container-toolkit and restarted docker (service). However same output/issue.


RE: Docker - lost nvidia/cuda after power outage - TheDreadPirate - 2024-09-28

Code:
sudo apt list --installed | egrep -i "nvidia|libnv|cuda"



RE: Docker - lost nvidia/cuda after power outage - turbochamp - 2024-09-28

(2024-09-28, 04:27 AM)TheDreadPirate Wrote:
Code:
sudo apt list --installed | egrep -i "nvidia|libnv|cuda"

I am on Arch, but with
Code:
pacman -Qi nvidia libnv cuda
(libnv not found):

Quote:pacman -Qi nvidia libnv cuda
Name            : nvidia-dkms-tkg
Version        : 560.35.03-258
Description    : NVIDIA kernel module sources (DKMS)
Architecture    : x86_64
URL            : http://www.nvidia.com/
Licenses        : custom:NVIDIA
Groups          : None
Provides        : nvidia=560.35.03  nvidia-dkms  nvidia-dkms-tkg=560.35.03  NVIDIA-MODULE
Depends On      : dkms  nvidia-utils-tkg>=560.35.03  nvidia-libgl  pahole
Optional Deps  : linux-headers [installed]
                  linux-lts-headers: Build the module for LTS Arch kernel
Required By    : None
Optional For    : None
Conflicts With  : nvidia  nvidia-dkms
Replaces        : None
Installed Size  : 80.06 MiB
Packager        : Unknown Packager
Build Date      : Fri 27 Sep 2024 10:12:50 PM EDT
Install Date    : Fri 27 Sep 2024 10:13:36 PM EDT
Install Reason  : Explicitly installed
Install Script  : No
Validated By    : None

error: package 'libnv' was not found
Name            : cuda
Version        : 12.6.1-1
Description    : NVIDIA's GPU programming toolkit
Architecture    : x86_64
URL            : https://developer.nvidia.com/cuda-zone
Licenses        : LicenseRef-NVIDIA-CUDA
Groups          : None
Provides        : cuda-toolkit  cuda-sdk  libcudart.so=12-64  libcublas.so=12-64  libcublas.so=12-64  libcusolver.so=11-64  libcusolver.so=11-64  libcusparse.so=12-64  libcusparse.so=12-64
Depends On      : opencl-nvidia  python  gcc13
Optional Deps  : gdb: for cuda-gdb [installed]
                  glu: required for some profiling tools in CUPTI [installed]
                  nvidia-utils: for NVIDIA drivers (not needed in CDI containers) [installed]
                  rdma-core: for GPUDirect Storage (libcufile_rdma.so)
Required By    : None
Optional For    : openmpi  openucx  sunshine-git
Conflicts With  : None
Replaces        : cuda-toolkit  cuda-sdk  cuda-static
Installed Size  : 4.72 GiB
Packager        : Jakub Klinkovský <lahwaacz@archlinux.org>
Build Date      : Fri 30 Aug 2024 12:39:20 PM EDT
Install Date    : Fri 30 Aug 2024 06:00:45 PM EDT
Install Reason  : Explicitly installed
Install Script  : Yes
Validated By    : Signature



RE: Docker - lost nvidia/cuda after power outage - crobibero - 2024-09-28

Try updating your docker-compose to specify nvidia devices.

Formatting may be off since I pasted from my phone
Code:
deploy:
  resources:
    reservations:
      devices:
          - driver: nvidia
             count: all
             capabilities: [gpu]



RE: Docker - lost nvidia/cuda after power outage - turbochamp - 2024-09-28

(2024-09-28, 09:43 AM)crobibero Wrote: Try updating your docker-compose to specify nvidia devices.

I get this error when restarting docker. Line 30 is the "count: all" line. Here's my changes:

Code:
yaml: line 30: mapping values are not allowed in this context

Code:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]



RE: Docker - lost nvidia/cuda after power outage - turbochamp - 2024-09-28

I guess my formatting was off. It now appears to be fixed, it works!

Here's with the right formatting:

Code:
deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

Then restarted docker-compose down, restarted docker service and docker-compose up