Thread Pool Starvation - Printable Version

Thread Pool Starvation - Printable Version

+- Jellyfin Forum (https://forum.jellyfin.org)
+-- Forum: Support (https://forum.jellyfin.org/f-support)
+--- Forum: Troubleshooting (https://forum.jellyfin.org/f-troubleshooting)
+--- Thread: Thread Pool Starvation (/t-thread-pool-starvation)

Thread Pool Starvation - natzilla - 2023-09-04

I am facing a situation where my system is being flooded with processes from /usr/bin/jellyfin

[Image: sid-e42ef379647a1bdb1a6b3468f51b79df0ceb...b?type=raw]

[Image: sid-e42ef379647a1bdb1a6b3468f51b79df0ceb...b?type=raw]

There are hundreds, possible thousands of these entries in htop. It's obvious something is hung here and I'd like to know some ways to further investigate it. I have not rebooted the server which in my experience does clear it, but I want to root cause this first.

More confirmation details regarding CPU usage being starved.

[Image: sid-81831d7faa9aa79844de86033e2b84486d22...3?type=raw]

[Image: sid-81831d7faa9aa79844de86033e2b84486d22...3?type=raw]

[Image: sid-4d1e1731e230f52b49d43cdda56195b0cf66...c?type=raw]

● jellyfin.service - Jellyfin Media Server
Loaded: loaded (/lib/systemd/system/jellyfin.service; enabled; vendor preset: enabled)
Drop-In: /etc/systemd/system/jellyfin.service.d
└─jellyfin.service.conf
Active: active (running) since Fri 2023-09-01 19:29:02 UTC; 2 days ago
Main PID: 732 (jellyfin)
Tasks: 3636 (limit: 18546)
Memory: 13.7G
CPU: 2d 19h 15min 58.966s
CGroup: /system.slice/jellyfin.service
└─732 /usr/bin/jellyfin --webdir=/usr/share/jellyfin/web --restartpath=/usr/lib/jellyfin/restart.sh --ffmpeg=/usr/lib/jellyfin-ffmpeg/ffmpeg

Sep 04 14:59:29 jellyfin jellyfin[732]: [14:59:29] [WRN] As of "09/04/2023 14:59:09 +00:00", the heartbeat has been running for "00:00:20.7576155" which is longer than "00:00:01". This could be caused by thread pool starvation.
Sep 04 14:59:50 jellyfin jellyfin[732]: [14:59:50] [WRN] As of "09/04/2023 14:59:31 +00:00", the heartbeat has been running for "00:00:10.8371778" which is longer than "00:00:01". This could be caused by thread pool starvation.
Sep 04 15:00:13 jellyfin jellyfin[732]: [15:00:13] [WRN] As of "09/04/2023 14:59:51 +00:00", the heartbeat has been running for "00:00:21.2132112" which is longer than "00:00:01". This could be caused by thread pool starvation.
Sep 04 15:00:42 jellyfin jellyfin[732]: [15:00:42] [WRN] As of "09/04/2023 15:00:22 +00:00", the heartbeat has been running for "00:00:20.3549039" which is longer than "00:00:01". This could be caused by thread pool starvation.
Sep 04 15:00:53 jellyfin jellyfin[732]: [15:00:53] [WRN] As of "09/04/2023 15:00:43 +00:00", the heartbeat has been running for "00:00:10.0880467" which is longer than "00:00:01". This could be caused by thread pool starvation.
Sep 04 15:01:16 jellyfin jellyfin[732]: [15:01:16] [WRN] As of "09/04/2023 15:00:55 +00:00", the heartbeat has been running for "00:00:21.4189816" which is longer than "00:00:01". This could be caused by thread pool starvation.
Sep 04 15:01:25 jellyfin jellyfin[732]: [15:01:25] [WRN] As of "09/04/2023 15:01:18 +00:00", the heartbeat has been running for "00:00:07.2229611" which is longer than "00:00:01". This could be caused by thread pool starvation.
Sep 04 15:01:37 jellyfin jellyfin[732]: [15:01:37] [WRN] As of "09/04/2023 15:01:26 +00:00", the heartbeat has been running for "00:00:10.6905181" which is longer than "00:00:01". This could be caused by thread pool starvation.
Sep 04 15:01:39 jellyfin jellyfin[732]: [15:01:39] [WRN] As of "09/04/2023 15:01:38 +00:00", the heartbeat has been running for "00:00:01.5668541" which is longer than "00:00:01". This could be caused by thread pool starvation.
Sep 04 15:01:43 jellyfin jellyfin[732]: [15:01:43] [WRN] As of "09/04/2023 15:01:41 +00:00", the heartbeat has been running for "00:00:02.4147117" which is longer than "00:00:01". This could be caused by thread pool starvation.

System details

No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 22.04.3 LTS
Release: 22.04
Codename: jammy

NVIDIA-SMI 525.125.06 Driver Version: 525.125.06 CUDA Version: 12.0

RE: Thread Pool Starvation - TheDreadPirate - 2023-09-04

Can you describe your setup? Number of users, GPU used for transcoding, storage for the VM/container, is this storage local or remote? Some local, some remote?

RE: Thread Pool Starvation - natzilla - 2023-09-04

(2023-09-04, 03:27 PM)TheDreadPirate Wrote: Can you describe your setup? Number of users, GPU used for transcoding, storage for the VM/container, is this storage local or remote? Some local, some remote?

Day to day the number of active users could be 4-6 but mostly around 2-3 sometimes.
GPU is a Quadro P400 - Not everything needs to transcode, but I did install the patch for unlocking the limit a while ago.
Storage for this VM is 200GB for the system, media storage is a local NFS share

RE: Thread Pool Starvation - Venson - 2023-09-04

Although i cannot put my thumb on it but there seems to be something fundamentally wrong with this setup. I see some ffmpeg processes crashing for no apparent reason, lots of network issues with corrupt packages, Plackback tracker not being cleaned up and more. Also chapter extractions being aborted.

I dont think its actually JFs issue but you really somehow started tons of JF instances.

RE: Thread Pool Starvation - natzilla - 2023-09-04

(2023-09-04, 03:32 PM)Venson Wrote: Although i cannot put my thumb on it but there seems to be something fundamentally wrong with this setup. I see some ffmpeg processes crashing for no apparent reason, lots of network issues with corrupt packages, Plackback tracker not being cleaned up and more. Also chapter extractions being aborted.

I dont think its actually JFs issue but you really somehow started tons of JF instances.

Your comment made me think it might be requests coming from my reverse proxy but I paused that container and it had no effect. I am watching the cpu counter lower than shoot back up so you are right.

RE: Thread Pool Starvation - TheDreadPirate - 2023-09-04

(2023-09-04, 03:30 PM)natzilla Wrote: Storage for this VM is 200GB for the system

Can you get more specific about the 200GB VM storage? What I'm trying to get at is whether the storage is local and what file system. All of the problems here and what Venson mentioned tell me that there is an issue with disk I/O and throughput.

How many VMs are you running on this machine?

RE: Thread Pool Starvation - natzilla - 2023-09-04

(2023-09-04, 04:08 PM)TheDreadPirate Wrote:
(2023-09-04, 03:30 PM)natzilla Wrote: Storage for this VM is 200GB for the system

Can you get more specific about the 200GB VM storage? What I'm trying to get at is whether the storage is local and what file system. All of the problems here and what Venson mentioned tell me that there is an issue with disk I/O and throughput.

How many VMs are you running on this machine?

Sure, Jellyfin's drive is currently the only VM running on this specific drive in my hypervisor. I have other disks for other VM's but kept jellyfin on it's own. It's a Samsung 870 EVO for jellyfin. It's a total 500GB capacity but limited it to 200GB

Storage is fully local to the hypervisor and it should be ext4 with the client, and the vm disk is raw

[Image: sid-71ea680157634557fa07ea1b08d6b691e7bc...0?type=raw]

[Image: sid-71ea680157634557fa07ea1b08d6b691e7bc...0?type=raw]

RE: Thread Pool Starvation - natzilla - 2023-09-04

The system appears to have calmed down now. I didn't do anything to it at all so I am at a loss. I checked the scheduled tasks page for anything that was running and it was all hours ago and taking less than a minute. I am at a loss.

Edit: take that back, the issue returned

RE: Thread Pool Starvation - natzilla - 2023-10-01

This still appears to be a problem after 10.8.11 update. Still very random, and I'm not sure whats causing it.

RE: Thread Pool Starvation - pcm - 2024-06-05

I'm wondering if /usr/lib/jellyfin/restart.sh has something to do with it. I'm taking a wild stab in the dark, but I'm thinking that jellyfin process is somehow thinking it not healthy and keeps trying to restart using the restart.sh script.

Someone familiar with how --restartpath flag works might be able to weigh in better.

In the meantime could you provide the last few lines of journalctl ?

Code:
journalctl -u jellyfin -n 200 --no-pager