Hard to believe it takes 16 CPUs running at 60%+ just to feed the GPU with data.
This is often misunderstood, high CPU load doesn't mean the CPU is actually busy with processing anything useful. For example, lets look at the OpenGL call `glFinish` on NVidia hardware. This API asks the driver to finish the current frame, and blocks until it is done, normally games etc do not call this as they don't care about when the frame is done as when it's done the card will just automatically flip buffers and put it on screen, however when rendering or encoding, the CPU needs to know when the "frame" is done to read it back from GPU memory to local system RAM.
Btw, I know that I am using OpenGL as the example here and the application likely uses DX, but under the hood the driver implements these primitives the same. For the sake of simplicity I have used glFinish as an example, but there are other synchronization primitives such as sync fences that operate the same at the driver level.
So lets imagine the render pipeline.
- Application reads a frame off disk into RAM
- Application feeds the frame to the GPU to encode it
- Application waits for the GPU to finish encoding the frame
- Application reads the frame from the GPU RAM to local RAM
- Application writes the encoded frame to disk
For the application to wait the application needs a mechanism to synchronize with the card. I know for a fact that on NVidia hardware the glFinish and glSync calls, while blocking perform what is called a "
SpinLock", which is essentially this:
while(frameNotReady) {}
Run that code and your core jumps to 100% usage as it is spinning in a tight loop polling for the completed frame. There is no event for this because windows is not a real time operating system (RTOS), even the shortest of sleeps would be too long (the OS scheduler wouldn't wake the process up again soon enough) and performance would suffer. For example, say you're rendering 1080p, the card will complete the frame extremely fast, and the CPU will not spin for long, showing lower CPU usage, however with a 4K frame, the CPU will spin for much longer, doing nothing, but reading high CPU usage.
I hope this makes it clear, that high CPU usage doesn't mean the CPU is actually doing much at all, 60% load seems high because it's just spinning for 99% of the time, likely the other cores that are seemingly not doing much are the ones writing to disk, etc. In reality they are doing more work then the CPUs that read high usage.
Btw, I have noted that AMD hardware behaves differently and seems to have better wait logic, likely using hardware interrupts rather then polling.