It works *if and only if* the signal is sparse in some transform domain. Hopefully what follows is a better description of evaluating the products of a mixer.
The products are known to be sparse. Solve Ax=y where y is sampled in the time domain at random times. A is a Fourier basis evaluated at those times. Each row of A contains all the frequencies in the Fourier transform series evaluated at t(j). The elements of the vector y are amplitudes samples at random times t(j). The elements of x are the unknown coefficients of the discrete Fourier transform of the signal.
The requirement for a successful solution is that the columns of A have low coherency. The crosscorrelation between columns must be small. An L1 solution *is* the optimal (i.e. L0) solution. The first paper by Donoho that I cited proves this. The L0 case is NP-Hard. The L1 solution can be obtained by standard linear programming methods. Most elements of x will be zero. The non-zero terms will be the amplitudes of the respective frequency components. Typically one would use the frequencies of the FFT for the frequencies of x. That way one recovers the FFT of the input signal. However, one could just as easily have a wavelet transform basis.
Using L1, the presence of broadband noise doesn't matter so long as it is low amplitude relative to the signal. That is *not* the case using the traditional Gm=d L2 (least squared error) solution. Shannon still applies, but the relevant criterion is the information; how sparse is the signal in some transform domain.
Shannon solved the continuum case using calculus before Wiener presented the discrete case. However, if you consider Shannon in the form of a sum of bandwidth limited terms one should arrive at the same result. That's speculation on my part, but it seems unlikely that both Shannon and Donoho could be correct unless it is the case.
The actual sampling requirement is much more complex than Shannon's result. So it's a bit of an abuse to reference Shannon, but I think it appropriate because it is an information theory concept. The proper mathematical treatment of sampling required is *very* painful to read. The corect treatment involves the properties of the A matrix. Determining those properties is at least as hard as solving the problem. A good rule of thumb is you need 10-20% of the sampling needed if one regularly sampled at Nyquist rates.
There's a lot more to this than just sampling. L1 basis pursuit is the solution to matrix completion (aka the Netflix problem), inverse problems (what I was doing when I stumbled into this), blind source separation (listen in on table 12 with a small number of microphones scattered around the room), identifying genome alleles responsible for some trait and a slew of other things. Not surprisingly, there were a number of people applying the technique before Donoho and Candes put it on a firm mathematical footing much as Heaviside was using operational mathematics before the mathematicians explained why it worked.