Author Topic: This code is so broken it some how fixes itself. (Read 4716 times)

paulca · « **Reply #25 on:** February 01, 2023, 03:23:33 pm »

Quote from: cv007 on February 01, 2023, 02:43:45 pm

Obviously lots of options and whether this one is better or not, I like to eliminate casting if there is a better option. In this case I would rather deal with the union than to have to (re)think about how the underlying data fits together when casting.

True. I was looking into unions for something else yesterday I can see how they can help with the constant duplicity of their types between signed 16 and unsigned 32.

It also depends on how the compiler optimises things.

That part of the code will change, as I have to try 24bit and you can't just "cast" right justified signed 24bit as easily as 16.

cv007 · « **Reply #26 on:** February 01, 2023, 04:04:10 pm »

Quote

I have to try 24bit and you can't just "cast" right justified signed 24bit as easily as 16.

Code: [Select]

typedef union { 
    volatile uint32_t dr_u32; 
    volatile int16_t eq_i16;
    volatile int32_t eq_i24 : 24; 
} sai_buffer_t;

When used, the compiler will sign extend the eq_i24 to an int32_t.

edit-
could also have a single member which would work the same for i16/i24, each sign extending to i32 when used-

Code: [Select]

static const unsigned SAI_DATA_BITS = 16;

union sai_buffer_t { 
    volatile uint32_t dr_u32; 
    volatile int32_t data : SAI_DATA_BITS;
};

Nominal Animal · « **Reply #27 on:** February 01, 2023, 05:32:57 pm »

I habitually use

Code: [Select]

typedef struct {
    union {
        double          d;
        float           f;
        uint64_t        u64;
        int64_t         i64;
        uint32_t        u32[2];
        int32_t         i32[2];
        uint16_t        u16[4];
        int16_t         i16[4];
        uint8_t         u8[8];
        int8_t          i8[8];
        unsigned char   uc[8];
        signed char     sc[8];
        char            c[8];
    };
} word64;

typedef struct {
    union {
        float           f;
        uint32_t        u32;
        int32_t         i32;
        uint16_t        u16[2];
        int16_t         i16[2];
        uint8_t         u8[4];
        int8_t          i8[4];
        unsigned char   uc[4];
        signed char     sc[4];
        char            c[4];
    };
} word32;

to examine the various parts of data elements, and reinterpret the storage as another type. For example, ((word32){ .f = value }).u32 returns the 32-bit storage representation of float value (and the code kinda-sorta assumes IEEE 754 Binary64 and Binary32 floating point formats).
When optimized, the machine code simplifies either to nothing (ARMv7-a with hardware floating point support) or a single register move (x86-64), too.

If you have a buffer with 24-bit signed integer or fixed-point samples, you can easily expand it to 32-bit signed integer or fixed point using

Code: [Select]

static inline void i24_to_i32_quad(int32_t *dst, const uint32_t src0, const uint32_t src1, const uint32_t src2)
{
    *(dst++) = ((int32_t)( src0 << 8 )) >> 8;
    *(dst++) = ((int32_t)( (src1 << 16) | ((src0 >> 24) << 8) )) >> 8;
    *(dst++) = ((int32_t)( ((src1 >> 16) << 8) | (src2 << 24) )) >> 8;
    *(dst++) = ((int32_t)( src2 )) >> 8;
}

void i24_to_i32(int32_t *buf, const size_t count)
{
    const size_t  quads = (count / 4) + !!(count & 3);

    const uint32_t *src = (uint32_t *)buf + 3*quads;
    int32_t *dst = buf + 4*quads;

    while (dst > buf) {
        dst -= 4;
        src -= 3;
        i24_to_i32_quad(dst, src[0], src[1], src[2]);
    }
}

but the buffer size must be padded with zeroes to a multiple of 16 bytes (or count a multiple of 4), and have room for 4 bytes per 24-bit value. As it progresses backwards from end of array to beginning of array, the conversion is done in-place, in units of 12 bytes in, 16 bytes out. This does do sign extension correctly, too.

Note that it does expect (negative integer) >> N to be arithmetic shift right; it is on GCC and Clang.

With GCC and Clang, one can also use

Code: [Select]

typedef struct {
    union {
       unsigned int  u24:24;
       int           i24:24;
       uint8_t       u8[3];
       int8_t        i8[3];
       char          c[3];
    } __attribute__((packed));
} __attribute__ ((packed)) word24;

typedef struct {
    union {
        uint32_t    u32[3];
        int32_t     i32[3];
        word24      w24[4];
        uint16_t    u16[6];
        int16_t     i16[6];
        uint8_t     u8[12];
        int8_t      i8[12];
        char        c[12];
    };
} word24x4;

void unpack_word24x4(int32_t out[4], const uint32_t in[3])
{
    const word24x4  temp = { .u32 = { in[0], in[1], in[2] } };
    out[0] = temp.w24[0].i24;
    out[1] = temp.w24[1].i24;
    out[2] = temp.w24[2].i24;
    out[3] = temp.w24[3].i24;
}

but this relies on the packed type attribute limiting the size of the word24 type to exactly three bytes, with byte alignment; this is true for GCC and Clang, but is not "standard" C. (Plus right shift with signed integer types being arithmetic shift right.)

As shown, unpack_word24x4() takes 12 bytes (in three 32-bit unsigned integers), and expands them to four signed 32-bit integers (with sign extension as expected).

Note that i24_to_i32_quad() is faster than unpack_word24x4() on 32-bit ARMv7 at least, if one examines the code at Compiler Explorer.
That is,

Code: [Select]

void unpack_word24x4(int32_t out[4], const uint32_t in[3])
{
    out[0] = ((int32_t)( in[0] << 8 )) >> 8;
    out[1] = ((int32_t)( (in[1] << 16) | ((in[0] >> 24) << 8) )) >> 8;
    out[2] = ((int32_t)( ((in[1] >> 16) << 8) | (in[2] << 24) )) >> 8;
    out[3] = ((int32_t)( in[2] )) >> 8;
}

is faster but harder to maintain than the other version above, on ARMv7 at least; both do yield the exact same results.

paulca · « **Reply #28 on:** February 01, 2023, 05:44:40 pm »

Most of that went over my head. However, the basic filters / processing I am using are only there because they were the first ones I found in a palatable (theivable) form.

I really should try and get biquad equivalents and use the arm math library with DSP optimisations. I might for example get a wider range of high order filters.

So I will return and review the data packing/unpacking at the same time.

For now I think I'm going to do a blunt ignorant shift and or like the HAL code does

On the optimisation... my assumption was that the C compiler would include things like hardware DP FPU and DSP extensions if they existed with no optimisations.

However now I'm wondering if it actually disables all of that so you can step over some floating point calcs in the debugger, where as if they are offloaded to teh FPU that becomes difficult?

I mean how much "extra" stuff is turned on for -O3, -Ofast

Nominal Animal · « **Reply #29 on:** February 01, 2023, 08:45:52 pm »

Quote from: paulca on February 01, 2023, 05:44:40 pm

Most of that went over my head.

Sorry!

To summarise, if you have arrays of 24-bit samples, three bytes per sample, I showed that it is best to handle them in groups of four samples (12 bytes), and very efficiently sign-expand to 32-bit. The rest was details, and two different approaches you can choose from, although one has better performance.

Quote from: paulca on February 01, 2023, 05:44:40 pm

On the optimisation... my assumption was that the C compiler would include things like hardware DP FPU and DSP extensions if they existed with no optimisations.

STM32F4 series has an ARM Cortex-M4 core, and thus ARMv7E-M architecture.
ST's AN4841: Digital signal processing for STM32 microcontrollers using CMSIS describes its DSP features (but also includes some of the STM32F7 series).
Since GCC and Clang try to implement ARM C Language Extensions, ACLE (ihi0053) is also useful.

In general, C is not an easy language to automatically SIMD-vectorize. For STM32F4, there are instructions that add, subtract, and/or multiply using pairs of signed or unsigned 16-bit values in each register, or quartets of 8-bit signed or unsigned values in each registers.

As of right now, you do need to use the ACLE macros to SIMD-vectorize your C code.

For example, let's say you have an array of 16-bit signed integer samples data, another array of 16-bit signed integer coefficients coefficient, both arrays having count elements (even; multiple of two count), and both arrays being 32-bit aligned (__attribute__((align (4)))), you can obtain the 64-bit sum (which cannot overflow since each product is between -0x3FFF8000 and +0x40000000, i.e. 31 bits):

Code: [Select]

#include <arm_acle.h>

typedef struct {
    union {
        uint32_t   u32;
        int32_t    i32;
        uint16_t   u16[2];
        int16_t    i16[2];
        uint16x2_t u16x2;
        int16x2_t  i16x2;
        uint8_t     u8[4];
        int8_t      i8[4];
        uint8x4_t  u8x4;
        int8x4_t   i8x4;
        char       c[4];
    };
} word32;

int64_t sum_i16xi16(const int16_t *data, const int16_t *coeff, const uint32_t count)
{
    const word32 *dz = (const word32 *)data + (count / 2);
    const word32 *d = (const word32 *)data;
    const word32 *c = (const word32 *)coeff;
    int64_t  result = 0;

    while (d < dz)
        result = __smlald((*(d++)).i16x2, (*(c++)).i16x2, result);

    return result;
}

using -O2 -march=armv7e-m+fp -mtune=cortex-m4 -mthumb (with gcc; use -Os instead of -O2 with clang).

The inner loop will compile to
.L3:
ldr r0, [r3], #4 ; 2 cycles
ldr r2, [r1], #4 ; 2 cycles
cmp ip, r3, #1 ; 1 cycle
smlald r4, r5, r2, r0 ; 1 cycle
bhi .L3 ; 2-5 cycles when taken
which by my count should yield 8 cycles, or 4 cycles per sample, or equivalently 0.25 samples per cycle.
Furthermore, the result is trivial to shift and then clamp to the proper range, so it is just a few additional cycles per buffer to adjust for the fractional bits in samples and/or coefficients. Me Like Dis.

If count is odd, the last sample and coefficient will be ignored. I suggest you ensure your array sizes are even, and set the final element of both arrays to zero, so you can round count up to the next even number without affecting the result.

The inner loop is equivalent to

Code: [Select]

    for (uint32_t i = 0; i < count/2; i++)
        result = __smlald(d[i].i16x2, c[i].i16x2, result);

but is written in a form that both gcc and clang can optimize to the above code or equivalent.
Each loop iteration computes two 16-bit products, summing them to result.

paulca · « **Reply #30 on:** February 01, 2023, 09:24:12 pm »

It's an H7, but I get the idea

paulca · « **Reply #31 on:** February 01, 2023, 09:26:22 pm »

While I have you

(sorry).

Is there a quick win to gain access to the DTCM SRAM? Googling finds me long linker script modifications.

Can't I just ignorantly use hard coded offsets by casting an integer like 0x200000 to a pointer? I'm only declaring 4 tiny arrays.

thm_w · « **Reply #32 on:** February 01, 2023, 10:21:59 pm »

Guess its harder than ITCM for some reason?

Quote

@note Some code parts can be executed in the ITCM-RAM (64 KB) which decrease critical task execution time, compared to code execution from Flash memory. This feature can be activated using '#pragma location = ".itcmram"' to be placed above function declaration, or using the toolchain GUI (file options) to execute a whole source file in the ITCM-RAM.

@note The application needs to ensure that the SysTick time base is always set to
1 millisecond to have correct operation.

@Note For the Cortex-M7, if the application is using the DTCM/ITCM memories (@0x20000000/ 0x0000000: not cacheable and only accessible
by the Cortex-M7 and the MDMA), no need for cache maintenance when the Cortex M7 and the MDMA access these RAMs.
If the application needs to use DMA(or other masters) based access or requires more RAM, then the user has to:
- Use a non TCM SRAM. (example : D1 AXI-SRAM @ 0x24000000)
- Add a cache maintenance mechanism to ensure the cache coherence between CPU and other masters(DMAs,DMA2D,LTDC,MDMA).
- The addresses and the size of cacheable buffers (shared between CPU and other masters)
must be properly defined to be aligned to L1-CACHE line size (32 bytes).

https://github.com/STMicroelectronics/STM32CubeH7/blob/master/Projects/STM32H747I-EVAL/Templates_LL/readme.txt
https://community.st.com/s/question/0D53W00001sKgU0SAK/stm32cubeide-doesn-not-support-declaring-variable-in-dtcm-on-stm32h7-directly

Nominal Animal · « **Reply #33 on:** February 01, 2023, 11:18:13 pm »

Quote from: paulca on February 01, 2023, 09:24:12 pm

It's an H7, but I get the idea

H7 is Cortex-M7, so also ARMv7E-M, but with newer FP with double precision support (+fp.dp), so you'll want to use
gcc -O2 -march=armv7e-m+fp.dp -mtune=cortex-m7 -mthumb ...
or
clang -Os -target=arm-arm-none-eabi -mcpu=cortex-m7+fp -mthumb ...
to compile the code. I use these for i.MX RT1062 (Teensy 4.x), too: it has a very similar ARM Cortex-M7 core (but NXP, not ST).

The same documentation applies.
The cycle counts are likely completely off, though, since I can find them only for Cortex-M0 and Cortex-M4, not Cortex-M7.

Quote from: paulca on February 01, 2023, 09:26:22 pm

Is there a quick win to gain access to the DTCM SRAM?

You can use a fixed pointer to access DTCM SRAM. Thing is, how will you ensure the addresses you use are not used by the compiler and/or linker already?

That is why modifying your linker script, typically by declaring a section that will be placed in the correct addresses, is used. If you ensure section "dtcm" is placed in the correct region, then
static type arrayname[count] __attribute__((section "dtcm"));
puts the array at the correct section, so it will end up in the correct memory addresses, and the linker will decide its exact address. Without modifications or support from the startup (startup.s, I believe), it will be uninitialized (have random contents) after bootup.

If you can show the linker script you already use (look for "Linker script for STM32H7 series", file name probably ends with .ld), and it does contain the "This software component is licensed by ST under BSD 3-Clause license", then you are allowed to publish it here and I can show how to modify it to add such a section. Most likely, it is just one added line.

paulca · « **Reply #34 on:** February 14, 2023, 06:25:46 pm »

Looked back into this today.

It would seem the linker script is already setup to use the DTC RAM. However, most things get allocated in RAM_D1 'above' it.

Linker script attached. I believe it's BSD license.

Any quick wins?

eutectique · « **Reply #35 on:** February 14, 2023, 06:58:05 pm »

Quote from: paulca on February 14, 2023, 06:25:46 pm

However, most things get allocated in RAM_D1 'above' it.

What are these things? What the map file shows? Any insights from arm-none-eabi-objdump -h yourfile.elf ? Are some symbols matching .data.* pattern located in RAM_D1 and some in DTCMRAM?

BTW, your linker script does not have section RAM_D1. Are you sure about it?

paulca · « **Reply #36 on:** February 14, 2023, 07:37:34 pm »

Quote from: eutectique on February 14, 2023, 06:58:05 pm

Quote from: paulca on February 14, 2023, 06:25:46 pm
However, most things get allocated in RAM_D1 'above' it.

What are these things? What the map file shows? Any insights from arm-none-eabi-objdump -h yourfile.elf ? Are some symbols matching .data.* pattern located in RAM_D1 and some in DTCMRAM?

BTW, your linker script does not have section RAM_D1. Are you sure about it?

Sorry, there are two linkers. The other one is attached,.

The "other" things are general variables, buffers, (.bss and .data I will assume)

The idea is to create an labelled section like .dtcm and allocate some of the buffers in there.

eutectique · « **Reply #37 on:** February 14, 2023, 08:47:38 pm »

Quote from: paulca on February 14, 2023, 07:37:34 pm

The idea is to create an labelled section like .dtcm and allocate some of the buffers in there.

Ok, then create it!

Add the following to the linker script:

Code: [Select]

.dtcm_section {
    KEEP(*dtcm*)
} > DTCMRAM

and the following, as suggested by Nominal Animal, to your source code:

Code: [Select]

static type arrayname[count] __attribute__((section "dtcm"));

paulca · « **Reply #38 on:** February 21, 2023, 04:50:30 pm »

Thanks! Got it working.

Adding /modified a few bits though.

Code: [Select]

 __attribute__((section(".dtcm_section"))) static uint32_t rx1Buffer[8];
 __attribute__((section(".dtcm_section"))) static uint32_t rx2Buffer[8];
 __attribute__((section(".dtcm_section"))) static uint32_t tx1Buffer[8] ;
 __attribute__((section(".dtcm_section"))) static uint32_t tx2Buffer[8];

Code: [Select]

  .dtcm_section :
  {
      . = ALIGN(4);
      KEEP(*dtcm*)
      . = ALIGN(4);
  } > DTCMRAM

I might test how much impact it has with the scope next time I'm checking the performance/timing.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: This code is so broken it some how fixes itself. (Read 4716 times)

paulca

Re: This code is so broken it some how fixes itself.

cv007

Re: This code is so broken it some how fixes itself.

Nominal Animal

Re: This code is so broken it some how fixes itself.

paulca

Re: This code is so broken it some how fixes itself.

Nominal Animal

Re: This code is so broken it some how fixes itself.

paulca

Re: This code is so broken it some how fixes itself.

paulca

Re: This code is so broken it some how fixes itself.

thm_w

Re: This code is so broken it some how fixes itself.

Nominal Animal

Re: This code is so broken it some how fixes itself.

paulca

Re: This code is so broken it some how fixes itself.

eutectique

Re: This code is so broken it some how fixes itself.

paulca

Re: This code is so broken it some how fixes itself.

eutectique

Re: This code is so broken it some how fixes itself.

paulca

Re: This code is so broken it some how fixes itself.

Share me