Author Topic: "Centre of Mass" of a bitset (Read 891 times)

tom66 · « **on:** August 20, 2024, 10:00:45 am »

This is part programming challenge and part problem to solve. I have a naïve solution to this problem, but I'm interested to see if there are other solutions available that might be more efficient or elegant.

This problem involves lane training for a SERDES interface. In this case, my lane training is done via software control. A control loop sets the delay on the input data lane and determines if the data received is valid for that given delay. This validity is stored in a bitmask for each delay. To maximise the reliability of received data, it is desired to centre the valid data position in the clock, and by choosing the centre valid value, this is likely to be the case (+/- some setup/hold variation). There are 32 valid delay values in this system corresponding to some fraction of the clock.

For instance, you might have a lane delay result that looks like this:

Code: [Select]

bit pos: 0 2  4 6  8 1  1 1  1 1  2 2  2 2  2 3
                     0  2 4  6 8  0 2  4 6  8 0
         ---------------------------------------
mask:    0000 0000 0001 1111 1111 0000 0000 0000

In this case, the "centre of mass" of the result is trivial: bit 11 + bit 19 = 15. So the ideal line delay to centre the data eye is 15.

But you might also have a result that looks like this:

Code: [Select]

bit pos: 0 2  4 6  8 1  1 1  1 1  2 2  2 2  2 3
                     0  2 4  6 8  0 2  4 6  8 0
         ---------------------------------------
         1111 1100 0000 0000 0000 0000 0000 1111

In this case, the valid delay wrapped around, because the line delay put us beyond a 360 degree phase shift. So, the valid delay is not trivially calculable, since adding 28 and 5 gives you 16, which is actually just about the most invalid value you could pick.

Now, it's trivial enough to solve this latter problem with modulo arithmetic. The bit values are shifted until the lower value is greater than the upper value. So in this case, we need to shift back 6 places. That means we average 22 and 31, giving 26; then we add the shift back on and modulo 32 which gives us a value of 0.

But, imagine you now have somewhat 'noisy' validity data. What if you could only check a small amount of the data and sometimes, you do get a valid response, but the next delay value is invalid. You might have a validity pattern that looks like this:

Code: [Select]

bit pos: 0 2  4 6  8 1  1 1  1 1  2 2  2 2  2 3
                     0  2 4  6 8  0 2  4 6  8 0
         ---------------------------------------
mask:    0000 0000 1111 1110 1110 0000 0000 1000

(Note: The ideal solution here is to increase the size of each test so we get a clearer delination between pass and fail, but imagine we can't do this.)

Now, taking the mean value here is not that useful: the first set bit is 8 and the last set bit is 28, which gives a value of 18, which is at the edge of the valid values. So not an ideal delay value.

I considered an alternative option, to sum all of the valid values, and divide that by the number of set bits. This gives a value of 14, which is in the centre of mass of the valid data.

However, this method does not work for the problem that is wrapped around, for instance:

Code: [Select]

bit pos: 0 2  4 6  8 1  1 1  1 1  2 2  2 2  2 3
                     0  2 4  6 8  0 2  4 6  8 0
         ---------------------------------------
mask:    1101 1100 0001 0000 0000 1000 0000 1111

Summing the values will give us a value of 14, like before, this is the 'most invalid' value possible.

I am curious what solutions others have to this problem. My thought is that the mask has to be shifted repeatedly until the most plausible value is found. Or, some kind of pre-processing could be applied to the data set to generate a 'clean' set of data.

moffy · « **Reply #1 on:** August 20, 2024, 10:54:05 am »

I assume that the set bitmask means 100% pass and bitmask 0 means less than 100%, your data would be more meaningful if there was a success count associated with each bitmask position. Also do you have a defined minimum failure rate for the data?

tom66 · « **Reply #2 on:** August 20, 2024, 11:00:41 am »

The reason we get less than ideal pass/fail data is because we're only looking at the header data, so only a few bytes need to match to consider it a pass. Statistically this is reasonably likely to occur even with incorrect delay values that cause the remainder of the data to be received incorrectly. There is also spread spectrum on the clock applied which can cause an invalid delay value to periodically show up as correct. I could increase the number of samples to get a count in each bin, but that would increase the lane training time, which is undesirable. I'm hoping to find a solution for the difficult case that can work with one sample per delay value (this still takes around 2 seconds to complete as our packet sizes are quite large.)

moffy · « **Reply #3 on:** August 20, 2024, 11:05:24 am »

Well best guess is to take the middle value of the largest number of consecutive 1's including the wrap around, but I'm not sure how much confidence you can put in that.

rhb · « **Reply #4 on:** August 20, 2024, 05:43:43 pm »

Are you seeking the alignment of a pattern? If so, do an FFT and a linear fit to the phase after multiplying by the pattern FFT. That is the L2 least squared error solution.

If you are looking for the densest concentration of points in a stream, do an FFT and pick the narrowest, largest area peak. That will be the broadest peak in the time domain. The phase will tell you when it occurs. Peak amplitude in the frequency domain is a good approximation for biggest concentration. Create a waterfall of consecutive spectra and you can observe phase shifts in an adequately sampled input.

Have Fun!
Reg

ejeffrey · « **Reply #5 on:** August 20, 2024, 06:46:37 pm »

I have solved this exact problem for basically the same reason.

The way I did it was to double the dataset to cover two periods. I just duplicated the data, I didn't actually run the sweep through two cycles. Then I ignored the boundary conditions, found the largest run of 1's, and picked the middle of it. I would do a sanity check on the minimum acceptable length of the run, then if the result was in the second half of the array I wrapped it down to the first half.

Someone · « **Reply #6 on:** August 20, 2024, 10:05:52 pm »

Boxcar average, sum the unit vectors
Exponential moving average, wrap the delta

If the choice is free then it depends how fast you need the result and/or where the algorithm needs to run.

Tation · « **Reply #7 on:** August 21, 2024, 08:16:34 am »

Quote from: ejeffrey on August 20, 2024, 06:46:37 pm

I have solved this exact problem for basically the same reason.

The way I did it was to double the dataset to cover two periods. I just duplicated the data, I didn't actually run the sweep through two cycles. Then I ignored the boundary conditions, found the largest run of 1's, and picked the middle of it. I would do a sanity check on the minimum acceptable length of the run, then if the result was in the second half of the array I wrapped it down to the first half.

Maybe also toggling isolated 0/1.

Nominal Animal · « **Reply #8 on:** August 21, 2024, 11:01:47 am »

Given a 32-bit binary number \$v\$,
$$v = \sum_{n=0}^{31} b_n \, 2^n$$
where bits \$b_n\$ are numbered in arithmetic order (\$b_0\$ least significant, \$b_{31}\$ most significant), the barycenter is at index \$c\$,
$$c = \frac{\displaystyle \sum_{n=0}^{31} n \, b_i}{\displaystyle \sum_{n=0}^{31} b_i}$$
but this does not take into account the periodicity of the pattern.

The naïve approach to finding the barycenter considering the periodicity of the pattern would be to find the longest sequence of zeroes, and rotate the pattern so that that run of zeroes are at one end. You can do this for example using a loop, rotating the value by one bit downwards per iteration, and picking the rotation that yields the smallest unsigned integer value for the rotation. The result is then the above barycenter \$c\$, plus the number of bits rotated downwards, modulo 32.
Edit: This is not actually correct, because it does not consider the cyclicity correctly. See #10 for correct version!

Similar logic can be applied if instead of a single test you have a histogram \$h\$ of the set bits. For example, if you have 20 sets of 32-bit results, \$h_n\$ would be the number of times bit \$n\$ was set among these 20 sets. To find the rotation, convert to a binary value by comparing each count to some limit, for example half the number of sets. The barycenter of a histogram is the same,
$$c = \frac{\displaystyle \sum_{n=0}^{31} n \, h_n}{\displaystyle \sum_{n=0}^{31} h_n}$$
plus the number of bits rotated downwards, modulo 32.
Edit: Same issue, this does not consider the cyclicity correctly.

As to programming challenge, I'm pretty sure there are ways to speed things up, and even an online algorithm (updating the center whenever a new bit is obtained).

Nominal Animal · « **Reply #9 on:** August 21, 2024, 12:37:33 pm »

Instead of barycenter, you could also consider using (after rotation) the position of the N/2'th set bit, when there are N bits set.

Some (rotated) 12-bit examples, with that mid set bit underlined, to see why:
0101 1010 0111
0000 0000 1011
0000 0010 0011
0000 0100 1001

Nominal Animal · « **Reply #10 on:** August 28, 2024, 06:49:20 pm »

My handling of the cyclicity was incorrect above! My sincere apologies. 😢

The definition of a cyclic barycenter among 32 regularly placed weights (be they bits or histogram bins) \$b_n\$, \$n = 0 \dots 31\$, is by definition
$$\left\lbrace \, \begin{aligned}
x &= \sum_{n=0}^{31} b_n \cos\left(\frac{n \pi}{16}\right) \\
y &= \sum_{n=0}^{31} b_n \sin\left(\frac{n \pi}{16}\right) \\
c &= \left\lfloor 32 + \frac{16}{\pi}\operatorname{atan2}(y, x) \right\rceil \mod 32 \\
\end{aligned} \right .$$
where \$c = 0 \dots 31\$ is the barycenter index.

Consider this as an unit circle, with 32 possible "weights" placed regularly on it, at 360°/32 = 11.25° intervals. We use standard 2D barycenter for these 32 points: the weights are either 0 or 1 for binary, or bin counts for histograms. The answer is the 11.25° sector from origin the barycenter ends up in. (The distance from origin to barycenter is not relevant, although very close to origin means the barycenter is not well defined.)

Thus, one very efficient implementation in C would be cyclic_barycenter(bits),

Code: [Select]

// SPDX-License-Identifier: CC0-1.0
#include <stdint.h>

// First quadrant arctangent for 32 directions; returns 0 .. 8.
static uint_fast8_t atan32_internal(uint_fast16_t y, uint_fast16_t x)
{
    static const struct {
        uint16_t    x;
        uint16_t    y;
    } points[8] = {
        {  .x = 26581, .y =  2618 },
        {  .x = 31703, .y =  9617 },
        {  .x =  9113, .y =  4871 },
        {  .x =  5716, .y =  4691 },
        {  .x = 23963, .y = 29199 },
        {  .x = 15914, .y = 29773 },
        {  .x =  2121, .y =  6992 },
        {  .x =  1025, .y = 10407 },
    };
    uint32_t  ppos, pneg;

    int_fast8_t  imin = 0, imax = 8;
    while (1) {
        int_fast8_t  i = (imin + imax) / 2;

        ppos = (uint32_t)y * points[i].x;
        pneg = (uint32_t)x * points[i].y;

        if (i == imin)
            return i + (ppos >= pneg);

        if (ppos > pneg)
            imin = i;
        else
        if (ppos < pneg)
            imax = i;
        else
            return i + 1;
    }
}

// Verified for x,y in -32768..32767 inclusive:
//    atan32(y, x) == (31 & (int)round(atan2((double)y, (double)x) * 16/PI))
uint_fast8_t atan32(int_fast16_t y, int_fast16_t x)
{
    if (y > 0) {
        if (x > 0) {
            return 31 & (     atan32_internal(y,  x));
        } else
        if (x < 0) {
            return 31 & (16 - atan32_internal(y, -x));
        } else {
            return 8;
        }
    } else
    if (y < 0) {
        if (x > 0) {
            return 31 & (32 - atan32_internal(-y,  x));
        } else
        if (x < 0) {
            return 31 & (16 + atan32_internal(-y, -x));
        } else {
            return 24;
        }
    } else {
        if (x > 0) {
            return 0;
        } else
        if (x < 0) {
            return 16;
        } else {
            return 0;  // (0, 0)
        }
    }
}

int cyclic_barycenter(uint32_t u)
{
    // Scale 3209 minimizes error while keeping maximum sum within signed 16-bit numbers
    static const int16_t  c[40] = {
         3147,  2965,  2668,  2269,  1783,  1228,  626, 0, -626, -1228, -1783, -2269, -2668, -2965, -3147, -3209,
        -3147, -2965, -2668, -2269, -1783, -1228, -626, 0,  626,  1228,  1783,  2269,  2668,  2965,  3147,  3209,
         3147,  2965,  2668,  2269,  1783,  1228,  626, 0,
    };

    int32_t  x = 0, y = 0;
    uint_fast8_t  i = 32;
    while ((i--)|u) {
        if (u & 1) {
            x += c[i];
            y += c[i+8];
        }
        u >>= 1;
    }

    return atan32(y, x);
}

Only one loop over the 32 bits are done; then, three conditionals, followed by a binary search over 8 entries. The binary search does exactly 4 iterations, with each iteration doing two 16×16=32-bit (unsigned) multiplications, for a total of 8 16×16=32-bit multiplications. This is why I claim this to be efficient.
Note that the cost is not much different when a histogram is used, although then I would switch to 32-bit signed integers and lookup (and 32×32=64-bit unsigned multiply operations).

Two look-up tables are used. One contains 40 signed 16-bit integers, and the other 8 pairs of signed 16-bit integers, for a total of 56 signed 16-bit integers, or 112 bytes of read-only memory.

I have verified that atan32(y,x) matches 31 & (int)round(16/3.14159265358979323846*atan2(y,x)) exactly, for all 16-bit signed x and y, -32768 .. 32767.

atan32_internal(y,x) takes a point in the positive quadrant, and returns the angle by comparing the direction to eight known points closest to but within the lower boundary of each 11.5° sector, using a binary search and exterior product. Essentially, if \$(x_A, y_A)\$ and \$(x_B, y_B)\$ are two vectors, \$y_A x_B - x_A y_B > 0\$ if the former vector is at a larger (positive) angle than the latter vector. The stored points are at the lower (positive) angle boundary of the 11.5° sector.

cyclic_barycenter(bits) uses a precalculated table of sines and cosines in a funky order: cos(0°)=c[31], sin(0°)=c[39], cos(11.5°)=c[30], sin(11.5°)=c[38], and so on. The scale is 3209, because it minimizes relative errors in the individual integer variables (compared to the exact values of sine/cosine), while keeping the maximum sum combination within signed 16-bit range, -32768..+32767 (in -32581..+32581, actually). Essentially, this minimizes the quantization error in the barycenter coordinates, while keeping the barycenter within 16-bit signed integer range.

I've only lightly tested it, but it seems to produce expected results, within the precision limits.

rhb · « **Reply #11 on:** August 30, 2024, 02:08:08 am »

Tom and I have had a bit of chat on the side. I'd like to restate the problem as I think he stated the problem poorly in his first post. There have been excellent replies in response to how he stated it, but there are important nuances he omitted that somewhat alter things.

In a high speed, clocked system, gate propagation delay can have significant variation with temperature and voltage. This results in increased bit error rates for high speed data transfer operations.

*A* solution is to sweep fractional clock cycle timing delays multiple times at startup to generate a histogram of the distribution of BER vs delay. Then pick the mode of the minimum value. Resweep if the BER changes.

There are likely other solutions, but given the problem as posed this seems to me the most practical. Monitoring temperature and delay can improve that and is probably essential for 1 ns and faster clock rates.

Build an empirical predictive model of the system behavior and you can narrow the error band quite a lot by the simple expedient of placing thermistors at key points in the system.

Very intriguing issue. I suspect the "black magic" books discuss it. I'll have a look and followup when I have some time.

Have Fun!
Reg

Nominal Animal · « **Reply #12 on:** August 30, 2024, 10:13:21 am »

Quote from: rhb on August 30, 2024, 02:08:08 am

*A* solution is to sweep fractional clock cycle timing delays multiple times at startup to generate a histogram of the distribution of BER vs delay. Then pick the mode of the minimum value. Resweep if the BER changes.

How wide, in timing delay values, tends the acceptable bit error rate zone to be?

I was wondering whether it would be possible to oscillate the delay ±1 at some relatively low rate around the current optimum delay, tracking the BER for each of the three separately, and move the center delay when some threshold is exceeded (perhaps by moving away from highest BER, rather than moving towards lowest BER).


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: "Centre of Mass" of a bitset (Read 891 times)

tom66

"Centre of Mass" of a bitset

moffy

Re: "Centre of Mass" of a bitset

tom66

Re: "Centre of Mass" of a bitset

moffy

Re: "Centre of Mass" of a bitset

rhb

Re: "Centre of Mass" of a bitset

ejeffrey

Re: "Centre of Mass" of a bitset

Someone

Re: "Centre of Mass" of a bitset

Tation

Re: "Centre of Mass" of a bitset

Nominal Animal

Re: "Centre of Mass" of a bitset

Nominal Animal

Re: "Centre of Mass" of a bitset

Nominal Animal

Re: "Centre of Mass" of a bitset

rhb

Re: "Centre of Mass" of a bitset

Nominal Animal

Re: "Centre of Mass" of a bitset

Share me