@Mike,
I'm actually curious about this issue. There would be trade offs, but I think a shorter integration (if they have the ability to even do so) might actually create a more stable image. Just hear me out...
So basically one thing I notice in particular, and you clearly identified, is a thermal drift. I'm chalking it up to low capacity pixels. The sensor heats up, but the pixel drifts a lot because it appears to change temperature quickly. So we see these blocks of various temperature differences, and I consider this noise as most would. A shorter integration time should decrease the thermal drift. Less time running power through the pixel, less time to produce heat, and thus less swing in pixel temp. Of course, you pointed out that this would also mean less time to heat the pixel from radiation. There's a trade off. But then you get more frames per average, reducing the noise floor and further reducing pixel drift of the signal. Also, better alignment of pixels per frame, so signal has a better shot at a averaging into the dataset. Remember most of the time the field of view isn't static. Even a slight shake of the hand will still throw off the average, blurring edges and lines, allowing random noise to outweigh the signal. As long as there is a measurable signal per frame, above the read noise, it should improve the image quality. It might reduce the minimum tempereture the sensor can detect. This is another trade off.
But hey, I'm just messing with the idea. I'm sure there is a flaw to this. I'm obviously missing a crucial part, otherwise I'd think they would have already done this. Or perhaps they already are taking the shortest integration they can. Maybe you are right, maybe the sensor is already at the limits of what can be done for the signal.