Author Topic: Practical math for measurements pre-processing (Read 3024 times)

iMo · « **on:** August 25, 2019, 12:02:40 pm »

I've been streaming the data off my 34401a's rs232 (yes, I know it is not a metrology grade instrument

) into an MCU (stm32) where I've been playing with pre-processing the data (and sending the data via BT in the .csv format out @115k2).

Now, my intention is to put inside the MCU some "useful math", such I can see some useful results without messing with excel. For example, when I hang an LCD display on the MCU I can see some useful data directly there.

While reading the older posts I saw 3roomlab's measurements where he applied stdev, skeweness, kurtosis, trimmed average. I've put some of the calcs inside, while I do moving averages on the raw data.

For example below a shot from TeraTerm with results I get at 10NPLC, from left hand side:
the measurement timestamp since start (ms), Tdmm(C), Tamb(C), raw voltage + stddev, moving average + stddev, 3xMAV multipass + stddev, skew, kurt, lin interp "a and b", and finally the processing time in ms (subtract 60ms for SD ADCs).
The skew, kurt and lin_interp is done upon 4xMAVed data here (is that the right way to do it??).

The MAVs of the voltages are made of 10 samples, the standard deviations are rolling 200 samples, temperature (16b ADCs) are MAV 10 samples, skew and kurt rolling 200 samples, linear interpolation is rolling 500 samples. All calcs are double fp. The number of rolling samples could be set differently, is the matter of free memory. The MCU runs at 48MHz (incl. BT powered off the outguard's 5V regulator).

Any hint what could be actually taken as an useful math in such an case?

@3roomlab: what formula you have actually taken for the kurtosis?

Kleinstein · « **Reply #1 on:** August 25, 2019, 02:41:34 pm »

If one has memory for all the data used, one can calculate the kurtosis just like the definition is. Otherwise there should be a way to do the calculation just as the data coming in and update a few (some 5) sums ina way that is common for the std. dev.

Normally the way to go is more like to log the raw data (e.g. 10 PLC readings) and calculate what is really needed. For real time display one may choose a few interesting points, like std. dev, drift rate, moving average.
Later processing can center calculations to that data forward and back are used.

RandallMcRee · « **Reply #2 on:** August 25, 2019, 02:49:25 pm »

Here is a streaming algorithm to calculate data moments. This is java but its trivial to put this into c++, # etc. The amount of memory is strictly bounded and small. (Knowing the authors I would bet that no better algorithm is possible!)

I think you might also consider throwing out outliers before updating the data moments. It would be interesting to see well-reasoned protocols to do that. In this vein google produces this EDN article:
https://www.edn.com/design/integrated-circuit-design/4442272/Moving-averager-rejects-noisy-outlier-values
but not sure how much credence to give to it. The code below is what I use for my data logging.

Code: [Select]

package arduino;
/*
* Copyright (c) 2003, the JUNG Project and the Regents of the University
* of California
* All rights reserved.
*
* This software is open-source under the BSD license.
* See http://jung.sourceforge.net/license.txt for a description.
*/

public class DataMoments {
    /**
     * A data structure representing the central moments of a distribution including: <ul>
     * <li> the mean </li>
     * <li> the variance </li>
     * <li> the skewness</li>
     * <li> the kurtosis </li></ul> <br>
     * Data values are each passed into this data structure via the accumulate(...) method
     * and the corresponding central moments are updated on each call
     *
     * @author Didier H. Besset (modified by Scott White and, subsequently, by Leland Wilkinson)
     */
    private double[] moments;

    public DataMoments() {
        moments = new double[5];
    }

    public void accumulate(double x) {
        if (Double.isNaN(x) || Double.isInfinite(x))
            return;
        double n = moments[0];
        double n1 = n + 1;
        double n2 = n * n;
        double delta = (moments[1] - x) / n1;
        double d2 = delta * delta;
        double d3 = delta * d2;
        double r1 = n / n1;
        moments[4] += 4 * delta * moments[3] + 6 * d2 * moments[2] + (1 + n * n2) * d2 * d2;
        moments[4] *= r1;
        moments[3] += 3 * delta * moments[2] + (1 - n2) * d3;
        moments[3] *= r1;
        moments[2] += (1 + n) * d2;
        moments[2] *= r1;
        moments[1] -= delta;
        moments[0] = n1;
    }

    public double mean() {
        return moments[1];
    }

    public double count() {
        return moments[0];
    }

    public double kurtosis() {
        if (moments[0] < 4)
            return Double.NaN;
        double kFact = (moments[0] - 2) * (moments[0] - 3);
        double n1 = moments[0] - 1;
        double v = variance();
        return (moments[4] * moments[0] * moments[0] * (moments[0] + 1) / (v * v * n1) - n1 * n1 * 3) / kFact;
    }

    public double skewness() {
        if (moments[0] < 3)
            return Double.NaN;
        double v = variance();
        return moments[3] * moments[0] * moments[0] / (Math.sqrt(v) * v * (moments[0] - 1) * (moments[0] - 2));
    }

    public double standardDeviation() {
        return Math.sqrt(variance());
    }

    public double variance() {
        if (moments[0] < 2)
            return Double.NaN;
        return moments[2] * moments[0] / (moments[0] - 1);
    }
}

iMo · « **Reply #3 on:** August 25, 2019, 04:48:02 pm »

Interesting streaming code, indeed.
We may also do "median" which throws outliers out too. And then the MAV upon it..

Kurtosis - there is not a standard formula for it afaik, I can see several variants. I've been using the wiki skewness and kurtosis formulas.

The question here is "what is really needed". For example de-trending the data based on the meter's temperature..

3roomlab · « **Reply #4 on:** August 26, 2019, 02:27:08 pm »

Quote from: imo on August 25, 2019, 12:02:40 pm

@3roomlab: what formula you have actually taken for the kurtosis?

I use the function inside opencalc. so im not sure about formula.
https://wiki.openoffice.org/wiki/Documentation/How_Tos/Calc:_KURT_function

I originally (in 2015) wanted to find some statistics functions (in opencalc spreadsheet) that could tell me which set of data is "flatter", I wanted to log very long data sets with very high relative NPLC as experiments. i left the logger to auto collect many hours of xls sets.
I added a kurt and skew summary to the end of all data sets (not as a function of 1 data point, but as a indicator of say 40000 points). in the end, most datasets are good captures and i no longer look specifically at kurt and skew to determine anything. TRIMMEAN is compared vs AVERAGE (w/o looking at plots, again as a summary of 1000 5000 30000 points of data), most of the very flat plots have very little difference. but again after some time, I no longer use TRIMMEAN to select data. but it is useful in a list of say 100 data sets, you could sort using them

example from a list of TRIMMEAN/AVERAGE i chose 1 which is larger than normal, and open that file and plot to see that it has many spikey things vs 1 which is relatively flat

edit: i think rolling data is used to help indicate a up/down trend (moving average). but to sample noise, or an absolute point, every sample must not use any previous data (repeating average). but i might be wrong about this

edit : but after all this math, I think the most fascinating is alan variance, if your dmm drift it shows as a problem, if ambient varies it shows as a problem, your dmm is noisy? it shows as a problem. the best summary tool i think.

edit : try write some auto logging that runs 1 or 2 days with thousands samples per file. esp those dusk dawn changeovers, might find some interesting noises when the neighbors turn on/off power

iMo · « **Reply #5 on:** August 27, 2019, 11:51:22 am »

Quote from: 3roomlab on August 26, 2019, 02:27:08 pm

..
edit: i think rolling data is used to help indicate a up/down trend (moving average). but to sample noise, or an absolute point, every sample must not use any previous data (repeating average). but i might be wrong about this

edit : but after all this math, I think the most fascinating is alan variance, if your dmm drift it shows as a problem, if ambient varies it shows as a problem, your dmm is noisy? it shows as a problem. the best summary tool i think.
..

The moving average (rolling average) - when calculated off a ring buffer - produces a new MAV value every sample, but it does not use the new value for calculation of the next one. It does simply an average of "the last N samples".

Not sure the "streaming algorithm" does the same for the mean, as it uses the new value for the next calculation.

Anyhow, ie. with the MAV10 (10 10NPLC samples) it is basically the same as the 34401a does with 100NPLC, with MAV10 you see data at 10NPLC rate with 10x10NPLC propagation delay. The multipass NxMAV10 filters improve the filter characteristics a little bit too (with Nx longer prop delay).

Allan Dev - you may pass the live dmm's voltage data into the TimeLab app as the "frequency" and you get it "on-line". Would be great if TimeLab would work with voltages, and provide graphs for "absolute" values in "freq difference" (now only ratios).

Andreas · « **Reply #6 on:** August 27, 2019, 01:59:40 pm »

Hello,

Interesting: I always wanted to have a automatic popcorn detection from my scope pictures.
the kurtosis seems to be a way.
Did you calculate it on the raw data and on how many samples?

with best regards

Andreas

iMo · « **Reply #7 on:** August 27, 2019, 02:16:41 pm »

The kurtosis in the above example is calculated upon the last 200 results coming from 4x MAV10 filter..
"Nx MAV10" filter means the actual result from a MAV10 filter is fed into the next MAV10 filter and so on.. (the MAV10 means MAV is done upon 10 last samples/inputs).
The first MAV10 is fed by the raw 10NPLC coming from the meter.

I do not use the streaming formulas for MAVs (and for skew, stdevs, and lin_interp) but ring buffers (for each separately) and brute force calcs.

For example the results in my first post come from calcs made upon 12 ring buffers.

3roomlab · « **Reply #8 on:** August 27, 2019, 03:13:17 pm »

I think I have low math-fu, when I look at the results in opencalc, i could understand it. but i dont understand it when i look at yours, esp when you explain it with MAV.

RandallMcRee · « **Reply #9 on:** August 27, 2019, 03:15:52 pm »

Imo: Can you publish your code and the dataset you used to get the popcorn detection?

I think it can easily be duplicated using just two Streaming objects, long term, short term.

That would be neat. If I can get it working I'll put it up on github.

Thanks,
Randall

iMo · « **Reply #10 on:** August 27, 2019, 04:15:05 pm »

@3roomlab: you do a measurement, say 10000 samples at 10NPLC, and then at the end you do all the statistics upon the entire set once.

What I do is different - my data at 10NPLC is coming in, and with each new voltage sample (and with each new Tamb and Tdmm sample) I do all the calculation, and the results are printed into a .csv file as a single line (see the line in the TeraTerm above).

Thus at 10NPLC (404.4ms period here) I do all calcs upon "the last xxx input data" I accumulated in my various ring buffers (for each output parameter there is an input ring buffer). My buffers have got various lengths - see my first post. Therefore all is "rolling" (except the raw input).

I do MAVs to low-pass filter the data out a bit (and MAV is fast). MAV10s are applied to Tdmm, Tamb, Raw voltage (MAV10 and 4xMAV10).
The standard devs, skew, kurt are with rolling 200 data, the LinearInterp with 500. All double precision fp.

@Randall - I will put it on my github soon..
Attached the data set in the .csv file - the input and calculated outputs..

3roomlab · « **Reply #11 on:** August 28, 2019, 02:27:33 am »

ah math ...

looks like i made some mistakes

at the time i used kurtosis, i simply read the description off the opencalc wiki.
ok lets take that and go math some "peaks" in the sheet. my understanding is only as good as that then
due to my confusion with "mr imo" posts and my lack of mathiness, i had a little more reading

it appears kurtosis is about a "1 parameter" description of a gaussian-ness. and not a full description of peakiness. at this point, for sanity purpose, does this sound like the kurtosis you all knew?

https://brownmath.com/stat/shape.htm

i start to understand that a high kurt value is not quite the "peak" detector i assume it to be
but again this is based on what is being read of the few sites
so what i have recorded is which set of data is more/less "gaussian", and a higher kurt number mean it is a data set with "most points nearer to zero"

for sanity checking again, does the above 2nd description sound right?

iMo · « **Reply #12 on:** August 28, 2019, 06:34:45 am »

From wikipedia:

Quote

Depending on the particular measure of kurtosis that is used, there are various interpretations of kurtosis, and of how particular measures should be interpreted.

The standard measure of kurtosis, originating with Karl Pearson, is based on a scaled version of the fourth moment of the data or population. This number is related to the tails of the distribution, not its peak;[1] hence, the sometimes-seen characterization as "peakedness" is mistaken. For this measure, higher kurtosis is the result of infrequent extreme deviations (or outliers), as opposed to frequent modestly sized deviations.

https://en.wikipedia.org/wiki/Kurtosis

I've been using the "sample excess kurtosis" g2 as in the wikipedia.

3roomlab · « **Reply #13 on:** August 28, 2019, 01:57:42 pm »

a messy overview of the kurt responses (or how gaussian is kurt =0 ?)
the yellow box tells the range/nplc and kurt vs the curve. all curves are normalized to 1.0
1v and 10v are shifted down/up for clarity

iMo · « **Reply #14 on:** August 28, 2019, 05:35:57 pm »

https://www.spcforexcel.com/knowledge/basic-statistics/are-skewness-and-kurtosis-useful-statistics

z01z · « **Reply #15 on:** August 29, 2019, 11:57:09 am »

Even though it is debated that kurtosis is a measure of peakedness, I think the way you are using it makes it just that.
“The kurtosis parameter is a measure of the combined weight of the tails relative to the rest of the distribution.”
As you are using it on a subset of the data, with a moving window, the jumps at the edges will be indicated.

mrflibble · « **Reply #16 on:** August 30, 2019, 06:00:37 pm »

Quote from: RandallMcRee on August 25, 2019, 02:49:25 pm

Here is a streaming algorithm to calculate data moments. This is java but its trivial to put this into c++, # etc. The amount of memory is strictly bounded and small. (Knowing the authors I would bet that no better algorithm is possible!)

Thanks for the code snippet. Saved for future reference.

On the subject of processing measurements, RStudio is a pretty good tool for statistical number crunching. And you get good quality plots (ggplot2 is awesome) included. You can of course also use plain R from the command line, but RStudio is an easy way to get started and see if it's a useful tool for you.

As for what definition of kurtosis, personally I'd stick with the strict moments based definition, if only for being consistent. So basically the definition given on wikipedia. Plus, it's computationally convenient, since chances are that you're also calculating the 1th, 2nd and 3rd moment.

As for moving windows and outliers .... that's a tricky one. Does anyone know of any good references on the subject? Havent looked yet, but maybe convolution with a windowing function is in order? Or maybe just something like applying a regularizer and then doing an l-1 fit, and then doing something clever. The "clever" bit depending on your prefered method of handling outliers. Personally I am not a big fan of discarding "outliers", since that essentially means throwing away information about the underlying process that generates your measurements aka samples from the distribution.

Mmmh, if one wanted to do that online (as in, process data as it is coming in) in the above manner, that would essentially boil down to solving a linear program that changes by a, hopefully small, amount. It seems reasonable to expect that the new minimum/maximum is going to be computationally cheaper to find than the initial run. Thoughts?

RandallMcRee · « **Reply #17 on:** August 30, 2019, 07:23:46 pm »

Yes, I think processing "bad" values is actually quite interesting, but non-trivial. In addition to the first link I already posted here is another:

https://anomaly.io/moving-median-robust-anomaly/index.html Points to a whole book on robust statistics!
https://www.researchgate.net/publication/302565982_Robust_Filtering_of_Time_Series_with_Trends

In general, the term to google seems to be "anomaly detection" or "time series anomaly detection". Seems like a research topic. Also try "anomaly detection signal processing", "robust filtering", etc.

iMo · « **Reply #18 on:** August 31, 2019, 08:16:29 am »

FYI - raw data (34401a 10V 10NPLC), rolling median 100 samples, rolling 2x MAV 10 samples.

iMo · « **Reply #19 on:** September 03, 2019, 06:50:55 am »

Users of the HP-34401A (fw ver 10-5-2 here) - be aware:

When processing the data coming from serial (or GPIB - doublecheck) for example in talk only mode, the values in the strings are with 7 decimal places.

My current understanding is the place which does represent 1uV is the last "valid" one.

With voltages <10V, like "+9.999.996.6E+00" the 7th digit (after the 3rd ".") is a garbage.

The 7th garbage digit is NOT random (or complete 0..9), but the pattern is as follows:

Code: [Select]

..0.2E..
..1.2E..
..2.3E..
..3.4E..
..4.4E..
..5.5E..
..6.6E..
..7.6E..
..8.7E..
..9.8E..

thus there are only 10 values for the ..0.0E.. till ..9.9E.., moreover showing rounding issue, and therefore not suitable for any subsequent calculation, imho.

I have not investigated the other ranges yet, however.

Edit: after below clarification my quick fix is not necessary, the 7th digit represents some ADC "min step" size.
~~As a quick fix I've been replacing the 7th digit with "0" when the second character in the string is not "1" (as in the "+1.1234567E+01" the 7th digit is 1uV - 11.234.567 - that is ok).~~

PS: the question now is how the HP's fw calculates the 100NPLC value as the average of ten 10NPLC measurements..

Edit: added thousands separators '.' for better readability.

Kleinstein · « **Reply #20 on:** September 03, 2019, 07:18:58 am »

I don't think the last digit is "garbage" and useless. Chances are the ADC has a limited resolution, that is no much higher. Like with most modern DMMs the raw steps from the ADC don't correspond to exactly µV or 100 nV steps, but more arbitrary step size not directly display related. Calibration is done in software by multiplication with a constant. So there will be voltage steps that are not exactly fit the digits and the last digits probably still originate from that calculation. They are thus not random or garbage, but could still be used to reduce the rounding errors. So the best solution is likely to just include the 7 th digit just as normal. It is somewhat in the noise anyway, so the difference would not be large and rounding away the extra digits would not introduce much extra noise.

As an extra complication the 34401 gets most of the resolution from the run-up and the rest from the auxiliary ADC (µC internal) this leads to 2 scale factors involved. So the raw ADC result may not be a simple integer, but could already have uneven steps, with a few smaller ones. Chances are this is well inside the firmware and no visible outside.

iMo · « **Reply #21 on:** September 03, 2019, 07:46:09 am »

So do you say the 7th digit's (100s nV) "irregularity" comes from the calibration calculation "y=ax+b" and not from the floating point precision/artifacts they use?
Then the "pattern" will differ based on the a and b calibration values..

PS: the pattern below 9.999.990.2 I see here in the raw data:
..89.1E
..88.0E
..87.0E
..85.9E
..84.8E
..83.8E

So it looks like the step is between 1.0 or and 1.1 uV.

Kleinstein · « **Reply #22 on:** September 03, 2019, 08:11:44 am »

The main part is from the limited ADC resolution, not the cal constant. One can see the differences between adjacent values are pretty constant at a little more than 10. SO this very much looks like a finite step size.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Practical math for measurements pre-processing (Read 3024 times)

Share me