You guys, it seems, have a lot of references sitting around...
Have you tried to do a Battle Royale between them?
Like take 3 of those, power them from separate batteries, tie 0 terminals together so they form a Benz-like star, let them warmup, of course.
Now voltages between tips should be less than 100mV (1V, if you're unlucky). In that range even my humble 34401As are precise enough (like 60nV stddev).
Take two DMMs (3 for sanity checks) and log the voltages between neighboring tips (with simultaneous triggering) for several hours.
Now you have 3 rows of data (third is the difference between first two, if only 2 DMMs used).
Now, estimate the variances. Each of those 3 variances is a sum of 2 corresponding variances of your references.
Now you have 3 unknowns and 3 equations so you can solve for individual variances!
Maybe rotate the setup and redo to see if repeatable.
Thus you can find the quietest reference of all.