I think the key here is that mixers (or multipliers, depending on how you want to define them) are
not multiplying voltages - they are multiplying
signals, represented in your case by voltages. However, they could just as well be represented by currents (as is common when discussing Gilbert cell mixers) or power levels. Given in most cases your ports will be of a certain well-controlled impedance, you can just switch between these units with some basic math. You can't really keep the two signals with their units, and they become unit-less, and the unit is inside of the K factor.
The multiplication is just a simple way of representing it. Think of a very basic diode being used as mixer:
On the input we apply the sum of our signals:
$$ V_{in}(t) = V_{LO}(t) + V_{RF}(t) = v_{LO} \cdot \sin(\omega_{LO} t) + v_{RF} \cdot \sin(\omega_{RF} t) $$
If we use the
Shockley diode equation, the output current is given by:
$$ I_{out}(t) = I_{S} \Bigg(\text{exp}\bigg(\frac{V_{in}(t)}{n V_{T}}\bigg)-1\Bigg)$$
This is then usually approximated with the Taylor series:
$$e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots $$
$$\text{exp}\bigg(\frac{V_{in}(t)}{n V_{T}}\bigg)- = 1 + \frac{V_{in}(t)}{n V_{T}} + \frac{\bigg(\frac{V_{in}(t)}{n V_{T}}\bigg)^2}{2!} + \frac{\bigg(\frac{V_{in}(t)}{n V_{T}}\bigg)^3}{3!} + \cdots $$
And that is how we get that typical multiplying behavior with all those harmonics. Note that unlike what some people have suggested, there are actually 'true' multiplying mixers that do not have all these harmonics, such as potentiometric mixers. But we then drive them hard for noise reasons, that all those harmonics are still on the output, not due to the mixer but because our LO is a square wave that has a bunch of harmonics itself.
But let me get back to the question at hand - where do the units go in the first place?
The Shockley diode equation actually gave us a clue. Look at all the units there, and you will figure out what is going on. Inside that exponential we have the input signal, units of Volt, that is divided by the thermal voltage, unit of, you guessed it, Volt. In other words, that entire exponential is a unit-less value. The unit of Ampere actually comes from this saturation current \$I_{S}\$, which depends on the design of our diode. So when we use the Taylor series (well, Maclaurin series) to give us the multiplying representation, we need to 'take into account' the fact that our value of x in the exponential has lost its unit, and thus the fact that it is squared doesn't give any strange squared voltages. I think it speaks for itself that you can just use your impedance to get from output current to output voltage, and all of this scaling and conversion is just lumped together in the \$K_{mixer}\$ term, as is the scaling you should do with the \$V_{T}\$ term.
Of course, this is just for a very simple diode mixer (not even a ring mixer, but literally just a diode), however, you will find that in other mixer types, you have similar conversions - for example in the potentiometric mixer, the LO the input voltage is converted into a conductance with the saturated mos equation, and then this conductance is multiplied with the second input voltage to give us the output current, which can then be turned into an output voltage using the output impedance of the system. Again, this is all captured in the \$K_{mixer}\$ term, and we just pretend our input signals are unit-less. you could, I guess, just say that the \$K_{mixer}\$ contains a 1/V term, but the reason we don't is because we are already making a lot of hand-wavy unit-ignoring assumptions - we pretend that DC term is not there, and that we only have only one output frequency, which we don't, and generally ignore any phase component, etc... So while we are at it, why don't we just ditch those units too?
EDIT: I'm fixing some of the math appearance