So what you need to do to calculate what uncertainty relative to the national standards you could achieve, for the meters that were calibrated, is like described in the
GUM sections 1-6, identify all the terms contributing to the uncertainty. Start with the value at the time of calibration:
- If you have data, then say at the time of calibration the value was the measured value plus the offset of their stimulus from nominal. It might be that the calibrator was not sourcing 1.0000 V but 1.0001V. See the first paragraph I wrote here for how to correct for that. The uncertainty at this time was the stated expanded uncertainty, which will likely be a 95% confidence interval, so divide my 1.96 to get the standard error.
- If you don't have the data, you should assume the meter was at that time at its nominal values but with the uncertainties it was adjusted to (1 year? 24h? it might say on the certificate). Ideally with some guard banding, like if they say they adjust if the value is beyond 80% of the tolerance band, then you can assume the value was within 80% of that value. Calculate the absolute value in Volt for a power of ten value near the top of the range (e.g. 1 V) reading by multiplying the value by the "of value" specification and adding the "of range" specification multiplied by the full range (check datasheet, could be 10V, or 12V or 20V for example). This should be treated as a rectangular distribution as described in section 4.3.7 of GUM, so the standard error is \$\sqrt{a^2/3}\$.
This calibration was some time ago, so apply the stability or accuracy specification (90 days, 180 days or 1 year depending on how long it's been any what kind of different specifications the instrument has). Convert to absolute volts (not percentages or ppms) as described above. This should be treated as a rectangular distribution as I described above.
The calibration was at a certain temperature that should be on the certificate, and the 90d etc stability / accuracy specifications will say something like within 5 degrees of the calibration temperature. If the ambient temperature at time of measurement is outside this range, apply the temperature coefficient from the specs multiplied by the number of degrees outside the range. This should again be assumed to be a rectangular distribution.
Then calculate the geometric sum (root of sum of squares) of all these components, and multiply by the coverage factor (k=2 is recommended by GUM). This is an estimate of what kind of uncertainty you might achieve, though the actual uncertainty will be slightly higher because you also need to add the standard error of the mean of the measurement as a factor, although this can be reduced by collecting more samples.
Do this for your 8502A and 8846A for every range and frequency. You could do this in Excel or Jupyter notebook with Python and Pandas. What I did for MM2022 was to make a table with all the calibration certificate data, one with the datasheet specs, and then had a function that looked up the correct value in each and did the calculation. You could do the same in Excel with VLOOKUP etc. Regarding the 8846A with replaced guts, it should still be within the manfucturer's specifications, but might drift more than a couple of years from now.
For every range / frequency I'd pick the meter with the lowest calculated uncertainty. This might well be the same meter for all. As a sanity check, the intervals of meter1_result +/- meter1_uncertainty and meter2_result +/- meter2_uncertainty should be overlapping for every value. The 34401A can vote but especially for the higher frequency AC ranges I wouldn't attribute too much importance, since I believe those are the most frequent to drift out of tolerance. Compare this expanded uncertainty to the accuracy and stability of the 5200A and to the offset that you measure now. The stability of the 5200A is like a lower limit: there's no point in setting the 5200A down to 1 uV/V if the 24h stability is 100 uV/V. The ~400-600 uV/V offset is the higher limit: If you measure the 5200A to be high by 400 uV/V, but the uncertainty of your measurement is 800 uV/V, then it might was well be 400 uV/V low and you wouldn't be able to tell from the measurement. Ideally you aim for the accuracy specifications in the data sheet, but otherwise you'll need to note that the uncertainty is higher than in the data sheet for those specific ranges.
You can do the same math I described above for the 3458A. The 3458A isn't really good enough above 100 kHz or 300 kHz to calibrate the 5200A, but it might be the best you have available. I would first focus on using it to measure the most stable standards you have, probably the 5440A. Depending on how long ago it was calibrated, the uncertainty might be about 4 uV/V for 10 VDC. If you plan any adjustments (like the 5200A) with the 3458A, I would rehease the adjustments with one of your other meters before you get the 3458A and investigate the best settings for the 3458A, since just punching the 'ACV' button may not give you the best results.
I see 0V measurements in your spreadsheet. I don't think measuring 0 VAC is of any use since multimeters are usually not specified below 1% or 3% of full scale. I don't think the 5200A performance verification procedure calls for a 0V AC measurement either.