I forgot simple incompetence repeating old errors; silly me.
Threads are fine, up to a limit of two per physical core Beyond that you want "job fragments" in a queue, and those are handled by one of the few worker threads. That can be extremely scalable.
It's always the boring errors when solving boring problems.
That's technically what this is supposed to be doing. There should be one worker per core. However somewhere along the line idiocy crept in and the consumer end of the queue consumes the entire queue until there is no work spawning a thread each. It just didn't explode until the queue was busy enough for the entire machine's resources being pissed away by the cumulative spinlocks whch eventually exceed the ability for it to process work leading to a lock concurrency issue and deadlock. Which happened after it working fine for a couple of years by the looks. Obviously then this is a false assumption by people originally looking at it that "it worked fine up until now so I don't see what the issue is"
The horrible outcome is actually caused by this line of code which is courtesy microsoft.
https://referencesource.microsoft.com/#system.core/system/threading/ReaderWriterLockSlim/ReaderWriterLockSlim.cs,1662 ... this is used by all "thread safe" (hahahaha) data structures. This is used as a lock implementation to aggressively cache data across all the threads to speed up a poorly written processing implementation.
Looking at the problem it solves it was solved with a sledge hammer rather than some grey matter so I'm going to look at cheaper ways to solve the problem than throwing it at £50k of hardware
At least it's a distracting problem
I have located the cause. One blocking IO bound thread and about 1000 threads spinwaiting on a resource lock
Threads they said. A good idea they said
That's why I'm a big fan of lock-free/wait-free architectures. However, they do need the programming to be done by grown-ups who understand how to take advantage of the 'still make progress' possibilities of that architecture rather than cargo culting them and just treating them as a new kind of spin lock.
There are a 100 lessons in your sentence there which 100 people I know have never and will never learn.
Edit: correct uncaffeinated use of words.