• Login
  • Register
  • Login Register
    Login
    Username/Email:
    Password:
    Or login with a social network below
  • Forum
  • Website
  • GitHub
  • Status
  • Translation
  • Features
  • Team
  • Rules
  • Help
  • Feeds
User Links
  • Login
  • Register
  • Login Register
    Login
    Username/Email:
    Password:
    Or login with a social network below

    Useful Links Forum Website GitHub Status Translation Features Team Rules Help Feeds
    Jellyfin Forum Support Troubleshooting Docker - lost nvidia/cuda after power outage

     
    • 0 Vote(s) - 0 Average

    Docker - lost nvidia/cuda after power outage

    turbochamp
    Offline

    Junior Member

    Posts: 6
    Threads: 1
    Joined: 2024 Sep
    Reputation: 0
    Country:United States
    #1
    2024-09-28, 02:53 AM
    My working docker container is suddenly throwing cuda errors after a power outage. Nothing else has changed, worked for months till this damn storm. nvidia-container-toolkit is installed, any thoughts?

    Quote:[AVHWDeviceContext @ 0x64df00579a40] Cannot load libcuda.so.1
    [AVHWDeviceContext @ 0x64df00579a40] Could not dynamically load CUDA
    Device creation failed: -1.
    Failed to set value 'cuda=cu:0' for option 'init_hw_device': Operation not permitted
    Error parsing global options: Operation not permitted


    And here's my docker-compose:

    Quote:services:
      jellyfin:
        image: jellyfin/jellyfin
        container_name: jellyfin
        user: 962:962
        network_mode: "host"
        environment:
          - JELLYFIN_CACHE_DIR=/var/cache/jellyfin
          - JELLYFIN_CONFIG_DIR=/etc/jellyfin
          - JELLYFIN_DATA_DIR=/var/lib/jellyfin
          - JELLYFIN_LOG_DIR=/var/log/jellyfin
          - NVIDIA_VISIBLE_DEVICES=all
          - NVIDIA_DRIVER_CAPABILITIES=all
        volumes:
          - /etc/jellyfinConfused-faceetc/jellyfin
          - /var/cache/jellyfinConfused-facevar/cache/jellyfin
          - /var/lib/jellyfinConfused-facevar/lib/jellyfin
          - /var/log/jellyfinConfused-facevar/log/jellyfin
          - /mnt/jellyfin12Confused-facemnt/jellyfin12
          - /mnt/Media-SSDConfused-facemnt/Media-SSD
          - /mnt/jellyfin14Confused-facemnt/jellyfin14
          - /mnt/jellyfin14-2Confused-facemnt/jellyfin14-2
          - /mnt/jellyfin22Confused-facemnt/jellyfin22
          - /mnt/jellyfin22-2Confused-facemnt/jellyfin22-2
        restart: "unless-stopped"
        deploy:
          resources:
            reservations:
              devices:
                - capabilities: [gpu]
    TheDreadPirate
    Offline

    Community Moderator

    Posts: 15,374
    Threads: 10
    Joined: 2023 Jun
    Reputation: 460
    Country:United States
    #2
    2024-09-28, 03:14 AM
    What is the output of nvidia-smi in the container?

    Code:
    docker exec -it jellyfin nvidia-smi
    Jellyfin 10.10.7 (Docker)
    Ubuntu 24.04.2 LTS w/HWE
    Intel i3 12100
    Intel Arc A380
    OS drive - SK Hynix P41 1TB
    Storage
        4x WD Red Pro 6TB CMR in RAIDZ1
    [Image: GitHub%20Sponsors-grey?logo=github]
    turbochamp
    Offline

    Junior Member

    Posts: 6
    Threads: 1
    Joined: 2024 Sep
    Reputation: 0
    Country:United States
    #3
    2024-09-28, 03:18 AM
    (2024-09-28, 03:14 AM)TheDreadPirate Wrote: What is the output of nvidia-smi in the container?

    Code:
    docker exec -it jellyfin nvidia-smi

    Thanks for quick reply, here is the output:

    Code:
    OCI runtime exec failed: exec failed: unable to start container process: exec: "nvidia-smi": executable file not found in $PATH: unknown
    TheDreadPirate
    Offline

    Community Moderator

    Posts: 15,374
    Threads: 10
    Joined: 2023 Jun
    Reputation: 460
    Country:United States
    #4
    2024-09-28, 03:53 AM
    Try reinstalling the nvidia container toolkit.
    Jellyfin 10.10.7 (Docker)
    Ubuntu 24.04.2 LTS w/HWE
    Intel i3 12100
    Intel Arc A380
    OS drive - SK Hynix P41 1TB
    Storage
        4x WD Red Pro 6TB CMR in RAIDZ1
    [Image: GitHub%20Sponsors-grey?logo=github]
    turbochamp
    Offline

    Junior Member

    Posts: 6
    Threads: 1
    Joined: 2024 Sep
    Reputation: 0
    Country:United States
    #5
    2024-09-28, 03:58 AM (This post was last modified: 2024-09-28, 04:00 AM by turbochamp. Edited 1 time in total.)
    (2024-09-28, 03:53 AM)TheDreadPirate Wrote: Try reinstalling the nvidia container toolkit.

    Reinstalled nvidia-container-toolkit and restarted docker (service). However same output/issue.
    TheDreadPirate
    Offline

    Community Moderator

    Posts: 15,374
    Threads: 10
    Joined: 2023 Jun
    Reputation: 460
    Country:United States
    #6
    2024-09-28, 04:27 AM
    Code:
    sudo apt list --installed | egrep -i "nvidia|libnv|cuda"
    Jellyfin 10.10.7 (Docker)
    Ubuntu 24.04.2 LTS w/HWE
    Intel i3 12100
    Intel Arc A380
    OS drive - SK Hynix P41 1TB
    Storage
        4x WD Red Pro 6TB CMR in RAIDZ1
    [Image: GitHub%20Sponsors-grey?logo=github]
    turbochamp
    Offline

    Junior Member

    Posts: 6
    Threads: 1
    Joined: 2024 Sep
    Reputation: 0
    Country:United States
    #7
    2024-09-28, 04:49 AM (This post was last modified: 2024-09-28, 04:50 AM by turbochamp. Edited 1 time in total.)
    (2024-09-28, 04:27 AM)TheDreadPirate Wrote:
    Code:
    sudo apt list --installed | egrep -i "nvidia|libnv|cuda"

    I am on Arch, but with
    Code:
    pacman -Qi nvidia libnv cuda
    (libnv not found):

    Quote:pacman -Qi nvidia libnv cuda
    Name            : nvidia-dkms-tkg
    Version        : 560.35.03-258
    Description    : NVIDIA kernel module sources (DKMS)
    Architecture    : x86_64
    URL            : http://www.nvidia.com/
    Licenses        : custom:NVIDIA
    Groups          : None
    Provides        : nvidia=560.35.03  nvidia-dkms  nvidia-dkms-tkg=560.35.03  NVIDIA-MODULE
    Depends On      : dkms  nvidia-utils-tkg>=560.35.03  nvidia-libgl  pahole
    Optional Deps  : linux-headers [installed]
                      linux-lts-headers: Build the module for LTS Arch kernel
    Required By    : None
    Optional For    : None
    Conflicts With  : nvidia  nvidia-dkms
    Replaces        : None
    Installed Size  : 80.06 MiB
    Packager        : Unknown Packager
    Build Date      : Fri 27 Sep 2024 10:12:50 PM EDT
    Install Date    : Fri 27 Sep 2024 10:13:36 PM EDT
    Install Reason  : Explicitly installed
    Install Script  : No
    Validated By    : None

    error: package 'libnv' was not found
    Name            : cuda
    Version        : 12.6.1-1
    Description    : NVIDIA's GPU programming toolkit
    Architecture    : x86_64
    URL            : https://developer.nvidia.com/cuda-zone
    Licenses        : LicenseRef-NVIDIA-CUDA
    Groups          : None
    Provides        : cuda-toolkit  cuda-sdk  libcudart.so=12-64  libcublas.so=12-64  libcublas.so=12-64  libcusolver.so=11-64  libcusolver.so=11-64  libcusparse.so=12-64  libcusparse.so=12-64
    Depends On      : opencl-nvidia  python  gcc13
    Optional Deps  : gdb: for cuda-gdb [installed]
                      glu: required for some profiling tools in CUPTI [installed]
                      nvidia-utils: for NVIDIA drivers (not needed in CDI containers) [installed]
                      rdma-core: for GPUDirect Storage (libcufile_rdma.so)
    Required By    : None
    Optional For    : openmpi  openucx  sunshine-git
    Conflicts With  : None
    Replaces        : cuda-toolkit  cuda-sdk  cuda-static
    Installed Size  : 4.72 GiB
    Packager        : Jakub Klinkovský <lahwaacz@archlinux.org>
    Build Date      : Fri 30 Aug 2024 12:39:20 PM EDT
    Install Date    : Fri 30 Aug 2024 06:00:45 PM EDT
    Install Reason  : Explicitly installed
    Install Script  : Yes
    Validated By    : Signature
    crobibero
    Offline

    Core Team (Server & Plugins)

    Posts: 243
    Threads: 0
    Joined: 2023 Jun
    Reputation: 17
    Country:United States
    #8
    2024-09-28, 09:43 AM (This post was last modified: 2024-09-28, 09:45 AM by crobibero. Edited 1 time in total.)
    Try updating your docker-compose to specify nvidia devices.

    Formatting may be off since I pasted from my phone
    Code:
    deploy:
      resources:
        reservations:
          devices:
              - driver: nvidia
                 count: all
                 capabilities: [gpu]
    turbochamp
    Offline

    Junior Member

    Posts: 6
    Threads: 1
    Joined: 2024 Sep
    Reputation: 0
    Country:United States
    #9
    2024-09-28, 01:31 PM (This post was last modified: 2024-09-28, 01:34 PM by turbochamp. Edited 2 times in total.)
    (2024-09-28, 09:43 AM)crobibero Wrote: Try updating your docker-compose to specify nvidia devices.

    I get this error when restarting docker. Line 30 is the "count: all" line. Here's my changes:

    Code:
    yaml: line 30: mapping values are not allowed in this context

    Code:
        deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: all
                  capabilities: [gpu]
    turbochamp
    Offline

    Junior Member

    Posts: 6
    Threads: 1
    Joined: 2024 Sep
    Reputation: 0
    Country:United States
    #10
    2024-09-28, 01:47 PM
    I guess my formatting was off. It now appears to be fixed, it works!

    Here's with the right formatting:

    Code:
    deploy:
          resources:
            reservations:
              devices:
                - driver: nvidia
                  count: all
                  capabilities: [gpu]

    Then restarted docker-compose down, restarted docker service and docker-compose up
    « Next Oldest | Next Newest »

    Users browsing this thread: 1 Guest(s)


    • View a Printable Version
    • Subscribe to this thread
    Forum Jump:

    Home · Team · Help · Contact
    © Designed by D&D - Powered by MyBB
    L


    Jellyfin

    The Free Software Media System

    Linear Mode
    Threaded Mode