Can you explain the cause?
The receiver/mixer has of course no image rejection. Unlike a SA, a network analyzer does not need that, since the stimulus is not wide-band, but only a single freqency at a time (as long it is granted that no foreign signal is present). It is important, though, that the IF filter rejects harmonics of the IF significantly
1), since both, stimulus and LO are suqare waves, therefore harmonics of the IF are expected at the mixer output, but we are only interested at the IF fundamental (or alternatively only in a single particular harmonic, when harmonic mode is used at high frequencies).
It were also interesting how well f+540kHz (9th IF harmonic) is suppressed, since 540kHz folds (aliases) back to 60kHz when sampled @600kSa/s, and 60kHz is exactly the IF frequency. So it cannot (must not) be eliminated by the digital filter, but already the analog lowpass in front of the ADC needs to do that. How well does it do its job? The same applies to the 11th, 19th, 21th,... IF harmonics as well, but the 9th is the first (and strongest) one folding back to 60kHz.
1) Either it notches them out, or alternatively it generally needs a high stop band rejection.
EDIT: The bizarre shape in the first image is the (expected) result of averaging. Since the window function of the coherent detector spans only a fraction of the measured samples, it effectively get repeated over the total measurement window, which leads to the depiced frequency response then. In the second image, the window function obviously spans all measured samples.