Author Topic: Cortex-M7 pipeline idiosyncrasies… (Read 8177 times)

ZaneKaminski · « **on:** November 04, 2022, 09:03:01 pm »

I was cycle-counting for an audio DSP project intended to run on a 180 MHz ARM Cortex-M4 microcontroller. The code makes extensive use of the ARM’s FPU. After extensively benchmarking the compiler output and then rewriting the crucial inner loop in assembly, I concluded that the 180 MHz STM32F4 just doesn’t have enough oomph to get the job done at the desired sample rate. I figured we can upgrade to one of the faster 216 MHz STM32F7s with the superscalar Cortex-M7. Target ship date is a year out and the product has a lot of cost margin so I figured there’s not a big issue specifying a chip that’s been in short supply as of late.

Anyway, I’ve never actually used the Cortex-M7 before so I was not aware of some of the big drawbacks. Firstly (and I expected this), FPU ops can’t be dual-issued. No big deal since loads and FPU ops can be paired. But neither can FPU ops and ALU ops dual-issue. Only ALU+ALU or FPU/ALU+load/store can be issued together. That’s not too bad since scalar DSP stuff often has equal number of load and FPU operations. The real kicker is the combination of the M7’s longer pipeline and lack of FPU bypassing. On the Cortex-M4, there’s a 2-clock latency for FPU operations so you suffer a stall if you use the result of an FPU op in the next instruction. On the M7, the FPU latency is 3 clocks, and of course to get more speed, you want to dual-issue. So you have to schedule two pairs of two instructions in between generating and consuming an FPU result. That’s a lot of latency! The arm-none-eabi-gcc compiler (with mcpu=cortex-m7 of course) is not exceptionally good at reordering instructions to get more speed. Best performance at least in a convolution kernel like I’m working on is still obtained by writing assembly. Of course the FPU latency means you have to sort of skew all the computations forward and then backfill other ops in between. Makes the code hard to read! I am always hopeful that the compiler will do a good job but in branchless unrolled loop kind of code where I know how to make it as fast as possible, I find I can usually do a fair bit better than the compiler. So with hand-optimized assembly, I was able to get a 1.85x speedup per clock on the convolution inner loop.

Anyone have any thoughts about this kind of thing? I could imagine that bypassing the FPU would be very costly in terms of transistor budget, but I am surprised to see that porting to the M7 from the M4 requires a recompile or else you may lose a lot of performance per clock. Makes more sense for an embedded microarchitecture than in a PC CPU but still, my perception of the M7 as a “modern” core has been somewhat dashed by the huge performance regression on the same branchless code. We might have to go for one of the faster 480 MHz STM32H7s anyway for parts sourcing reasons, and it's not a strongly power-constrained application, so like, whatever, it's solved. Still the M7 has left me with a slightly bad taste in my mouth. Am I being too harsh since a recompile is kind of a given in embedded applications and you can just get a faster clock speed Cortex-M7 MCU?

uer166 · « **Reply #1 on:** November 04, 2022, 09:20:12 pm »

I've never hand-optimzied assembly to this degree before, but it sounds like you're trying to use a general-purpose core "with DSP features" as a true FP DSP, so that might be part of the disconnect. What I've seen done on stuff like TMS320 is use their "control law accelerators" to do this kind of work (https://training.ti.com/control-law-accelerator-cla-hands-workshop), instead of the cores. A recompile between M4 and M7 doesn't sound too obscene, they're different in some important ways for your application.

Out of curiosity, were you executing out of CCM RAM? How did you do the cycle counting?

dietert1 · « **Reply #2 on:** November 04, 2022, 10:47:19 pm »

Instead of hand-optimizing assembly code one could try and write some smart level/scale supervision so the convolution can be done with integers. As far as i remember the Cortex DSP supports 32 x 32 with 64 accumulation as one instruction. I think that covers the dynamics of audio.

Regards, Dieter

eutectique · « **Reply #3 on:** November 05, 2022, 12:40:12 am »

Some decades ago I worked at Intel, writing integrated performance primitives for image and signal processing on StrongARM (ARMv4) and XScale (ARMv5) CPUs. At the time, the latter existed in a form of cycle-accurate simulator only. All work was done in hand-written assembly. Data formats included integer 8-, 16-, 32-bit, and fractional s15 and s31 fixed-point.

Some years later, at different place, I've done several sound-processing projects with ARM926 processors. Again, heavy use of DSP instructions, fixed-point fractional arithmetic, and hand-optimised assembly.

I would dump FPU completely and do the job with integer DSP instructions. MAC (multiply-accumulate), saturating arithmetic, cache and PLD instruction, branch prediction, TCM, assembly -- all that you would need. Forget the C compiler.

ZaneKaminski · « **Reply #4 on:** November 05, 2022, 09:01:43 am »

Quote from: uer166 on November 04, 2022, 09:20:12 pm

I've never hand-optimzied assembly to this degree before, but it sounds like you're trying to use a general-purpose core "with DSP features" as a true FP DSP, so that might be part of the disconnect. What I've seen done on stuff like TMS320 is use their "control law accelerators" to do this kind of work (https://training.ti.com/control-law-accelerator-cla-hands-workshop), instead of the cores. A recompile between M4 and M7 doesn't sound too obscene, they're different in some important ways for your application.

Out of curiosity, were you executing out of CCM RAM? How did you do the cycle counting?

Ah yeah, you hit the nail on the head. It's not really a DSP.

As for the cycle counting, I had two methods. First (in the course of board bringup) I just hooked a scope up to a GPIO pin and ran a sequence of copies of the inner loop from internal SRAM bookended by toggling the GPIO. Then I divided the time between GPIO toggles by the clock period and number of iterations. The result is lent a bit more credibility if it rounds to something divisible by the number of copies of the code concatenated together between toggling the pin. That's easy but the scope is annoying when doing a lot of software stuff so I switched to counting with one of the STM32's timers. Basically the same methodology but you get the total time from the counter, which may run slower than the CPU clock, so you have to apply an additional scaling factor.

Re: CCM/TCM RAM, I don't think the STM32F446 has it, in the sense that all internal SRAM is accessed over an AHB crossbar bus. I tried to mitigate possible instruction fetch latency in my benchmarking by putting a bunch of NOPs before the GPIO/timer stuff kicking off the big run of benchmark code.

Quote from: dietert1 on November 04, 2022, 10:47:19 pm

Instead of hand-optimizing assembly code one could try and write some smart level/scale supervision so the convolution can be done with integers. As far as i remember the Cortex DSP supports 32 x 32 with 64 accumulation as one instruction. I think that covers the dynamics of audio.

Regards, Dieter

Quote from: eutectique on November 05, 2022, 12:40:12 am

Some decades ago I worked at Intel, writing integrated performance primitives for image and signal processing on StrongARM (ARMv4) and XScale (ARMv5) CPUs. At the time, the latter existed in a form of cycle-accurate simulator only. All work was done in hand-written assembly. Data formats included integer 8-, 16-, 32-bit, and fractional s15 and s31 fixed-point.

Some years later, at different place, I've done several sound-processing projects with ARM926 processors. Again, heavy use of DSP instructions, fixed-point fractional arithmetic, and hand-optimised assembly.

I would dump FPU completely and do the job with integer DSP instructions. MAC (multiply-accumulate), saturating arithmetic, cache and PLD instruction, branch prediction, TCM, assembly -- all that you would need. Forget the C compiler.

Yeah, fixed-point would have offered more options for performance optimization if we pursued it in the first place. Indeed the current software is basically an emulation of an old ASIC with fixed-point processing. Unfortunately we (read: someone else) developed the "meat" of the solution on the PC (using double-precision) and I'm just trying to take that code and make it work on the MCU. Obviously the difference between double/float doesn't matter by the time you send the sample to the 16-bit DAC so the conversion to 32-bit float was a no-brainer. But I don't wanna change from floating to fixed point at this time... I think I have the critical inner loop worked out just about as well as possible using floats, with only a few instructions in the big convolution kernel incurring stalls or not paired for dual-issue. Basically the inner loop is a bunch of repeated y += w[k] * (a[k] + b[k]). Conveniently, there are exactly as many FPU ops as loads, so 32-bit fixed-point wouldn't be faster and we could only gain speed by going to 16-bit fixed point and SIMD, I think, which maybe wouldn't be precise enough. Not all 16 bits are needed since the original chip was less precise than that but I dunno if 16-bit fixed-point will suffice.

Siwastaja · « **Reply #5 on:** November 05, 2022, 09:53:25 am »

I did this once,

1) Disable dynamic branch prediction and, I found out, undocumented DISFOLD bit needs to be set, too, for predictable branch timing (write 0x04006004 to ACTRL)
2) Just unroll the heck out of your loops to get rid of branching as much as you can
3) Use core coupled memory (ITCM) for the routine

With above, I was able to get completely predictable timing. At that point, it did not matter whether I did it by counting cycles (complicated because you need to understand how exactly dual issue works) or just test & measure.

Of course, a "true" DSP would make your life easier, but M7 CPUs are cheap and ubiquitous and come with good set of peripherals so if the DSP is just a small part of your application, then having a simple one-MCU-solution instead of MCU + DSP is definitely an advantage, but you pay the price with more complicated development and uncertainty whether you can actually pull it off or not.

SiliconWizard · « **Reply #6 on:** November 05, 2022, 08:03:50 pm »

And, of course, as we repeatedly say, you shouldn't count on precise execution timings on those processors anyway. There's a way of achieving what you want achieved in a much better way in 99.9% of the cases using proper coding and proper use of peripherals.

eutectique · « **Reply #7 on:** November 05, 2022, 09:54:01 pm »

Quote from: ZaneKaminski

As for the cycle counting, I had two methods.

There is a third way. DWT unit has DWT_CYCCNT register, which shows the CPU cycle count. Usage pattern is

Code: [Select]

static volatile int32_t cycles = 0;

    cycles -= DWT->CYCCNT;

    ... // critical code here

    cycles += DWT->CYCCNT;

The names are from STM32 SDK, but you get the idea. A further enhancement would be to make cycles an array and collect independent counts on several runs.

Quote from: ZaneKaminski

Yeah, fixed-point would have offered more options for performance optimization if we pursued it in the first place. Indeed the current software is basically an emulation of an old ASIC with fixed-point processing. Unfortunately we (read: someone else) developed the "meat" of the solution on the PC (using double-precision) and I'm just trying to take that code and make it work on the MCU.

You are on the right track then. Pass test vectors through PC version and compare your MCU results against them.

Quote from: ZaneKaminski

Of course the FPU latency means you have to sort of skew all the computations forward and then backfill other ops in between. Makes the code hard to read! I am always hopeful that the compiler will do a good job but in branchless unrolled loop kind of code where I know how to make it as fast as possible, I find I can usually do a fair bit better than the compiler. So with hand-optimized assembly, I was able to get a 1.85x speedup per clock on the convolution inner loop.

Yes, that's the way of doing DSP on general-purpose CPU (which Cortex-M is) with a general-purpose compiler -- compile C code to assembly, and then optimise it to death by hand. Add comments with pipeline and data latencies.

And good luck!

Marco · « **Reply #8 on:** November 05, 2022, 10:51:00 pm »

Quote from: Siwastaja on November 05, 2022, 09:53:25 am

2) Just unroll the heck out of your loops to get rid of branching as much as you can

Can't the M7 do zero overhead looping if you do the compare early enough in the body?

brucehoult · « **Reply #9 on:** November 06, 2022, 02:29:56 am »

Quote from: ZaneKaminski on November 04, 2022, 09:03:01 pm

So with hand-optimized assembly, I was able to get a 1.85x speedup per clock on the convolution inner loop.

Sorry .. a 1.85x speedup compared to what? The M4F? So you've got near perfect dual-issue (85% of the time) despite the longer FPU latency? Sounds pretty awesome!

Dual-issue speedups are often in the 1.4x to 1.6x range, which is usually reckoned to be an excellent efficiency gain as they only use about 15% more hardware.

And that's in addition to the 216/180 = 1.2x faster clock? So you've actually got 2.22x performance increase?

M7s are available in 600 MHz and the one on my Teensy 4.0 seems perfectly happy (if a little hot) at 960 MHz, as in perfect performance scaling to that speed.

ZaneKaminski · « **Reply #10 on:** November 06, 2022, 05:49:18 am »

Quote from: Siwastaja on November 05, 2022, 09:53:25 am

undocumented DISFOLD bit needs to be set, too, for predictable branch timing (write 0x04006004 to ACTRL)

Ooh thanks for the info. There weren't any branches in the DSP kernel anyway anyway so I didn't worry about that. It's unrolled 192 times over.

Quote from: eutectique on November 05, 2022, 09:54:01 pm

There is a third way. DWT unit has DWT_CYCCNT register, which shows the CPU cycle count.

Yeah, I figured there would be something like that but I just used the general-purpose interval timer which I had already figured out how to use rather than go through the docs. I'll use this in the future though, thanks.

Quote from: brucehoult on November 06, 2022, 02:29:56 am

Quote from: ZaneKaminski on November 04, 2022, 09:03:01 pm
So with hand-optimized assembly, I was able to get a 1.85x speedup per clock on the convolution inner loop.

Sorry .. a 1.85x speedup compared to what? The M4F? So you've got near perfect dual-issue (85% of the time) despite the longer FPU latency? Sounds pretty awesome!

Dual-issue speedups are often in the 1.4x to 1.6x range, which is usually reckoned to be an excellent efficiency gain as they only use about 15% more hardware.

And that's in addition to the 216/180 = 1.2x faster clock? So you've actually got 2.22x performance increase?

M7s are available in 600 MHz and the one on my Teensy 4.0 seems perfectly happy (if a little hot) at 960 MHz, as in perfect performance scaling to that speed.

Well, that's with the hand-optimized code. I was sort of lamenting the fact that the arm GCC compiler did quite a bad job compared to my handwritten assembly. I can't remember the exact figure, but yeah, it was in the 1.4x range. This was particularly disappointing because from a compiler-algorithmic standpoint, what I did in the handwritten assembly to get the 1.85x speedup wasn't too complicated.

I started with a naive, unrolled implementation in assembly. The implementation used 29 floating point registers (out of 32). Having a few spare registers is key for the transformation I did. Knowing that the M7's FPU latency is 3 clocks and knowing that I had equal numbers of loads and FPU ops, I cut-and-pasted all the loads from the naive implementation into a separate table which would form the backbone of the fully-optimized implementation. Loads can't dual-issue with other loads so my goal was to pair each load with an FPU or ALU operation. With the loads isolated from the FPU/ALU ops, I reordered the loads to minimize "latency skew" between two loads whose results would then be used by the same instruction. My original assembly code reused some registers, so I had to rename all the registers to temporary virtual names. Then I started adding in FPU ops to the big table of loads, pairing them with loads and renaming the registers in the FPU ops for consistency. The first few loads had to be unpaired since I assumed a 3-clock latency between loading and using the result. I didn't measure load latency so I chose 3 clocks to match the (known) FPU latency. I think the load latency from internal SRAM is actually two clocks. Were I working from external RAM, maybe I'd assume more load latency. Because of the load and FPU latency, I had to space out dependent operations within one iteration of the unrolled loop and interleave them with independent operations from subsequent loop iterations. The FPU and load latency determines the interleave factor and the number of operations per loop iteration determines the span of time over which a single loop iteration's result is calculated. Once I had everything written out, I re-renamed the registers from virtual names back to physical registers. Since the computation of a single iteration's result was interleaved with the computation of the results of other iterations, a few more registers were required than I originally used in the naive assembly implementation.

Isn't this basic optimizing compiler stuff? You'd think it would do what I described automatically, since it's algorithmically pretty simple. I tried -funroll-all-loops and it unrolled the loop but it didn't interleave the iterations properly to hide the FPU latency. So yeah, I knew I could get almost 2x if I wrote the code myself, I'm just disappointed in the toolchain. And the 3-clock FPU latency is sort of inconvenient, though not a big deal since you just have to do a wider interleave factor between independent loop iterations to compensate. I would also add that I stuck this big unrolled loop in the middle of some C code as inline assembly. I chose the registers to use and moreover, I specifically put a few instructions at the beginning and end which are not paired for dual-issue. These few unpaired instructions at the beginning and end are responsible for the speedup being only 1.85x instead of 2x. If only the compiler could do the optimization I described, maybe it could find some stuff in the preceding or subsequent C code that could be dual-issued with those unpaired instructions in my inline assembly. Or maybe it could do the register allocation better. My optimized implementation doesn't use register s1 but that's caller-save in the usual calling convention. Maybe using s1 and avoiding the use of one of the callee-save s16-s31 registers would eliminate a push/pop in the surrounding function. Too many things to consider for such a small gain and as you said, there's a 2.22x performance increase so the problem is solved. Just disappointing that the compiler doesn't seem to know much about the latencies in the core and that it doesn't try a slightly tricky optimization like I described.

brucehoult · « **Reply #11 on:** November 06, 2022, 06:21:12 am »

As a compiler engineer I'm not even slightly surprised that you were able, with quite some effort (how long?), to make code 30% faster than the compiler.

One of the most important properties is a compiler is to compile code fast. They do only those optimisations that they can do quickly. Most people would be upset if the compiler spent even 10ms on your kernel, let alone 100ms. In fact probably it spends under 1ms. Code gets compiled again and again and again.

What you want is some other tool that will happily spend 100ms or 1s or 10s or 10 minutes on your kernel and spit out amazingly optimal assembly language for that one exact pipeline. And you only ever run it once, unless the algorithm changes.

DC1MC · « **Reply #12 on:** November 06, 2022, 09:08:18 am »

Quote from: brucehoult on November 06, 2022, 06:21:12 am

As a compiler engineer I'm not even slightly surprised that you were able, with quite some effort (how long?), to make code 30% faster than the compiler.

One of the most important properties is a compiler is to compile code fast. They do only those optimisations that they can do quickly. Most people would be upset if the compiler spent even 10ms on your kernel, let alone 100ms. In fact probably it spends under 1ms. Code gets compiled again and again and again.

What you want is some other tool that will happily spend 100ms or 1s or 10s or 10 minutes on your kernel and spit out amazingly optimal assembly language for that one exact pipeline. And you only ever run it once, unless the algorithm changes.

So far I kind of understand what's happening, but this let me scratching my head, where is it coming from, could you please elaborate ? Personally I don't mind and never knew or heard of anyone in the industry that will not trade even a day for a compiled code 30% faster than stock compile options ? Or was it some undetected sarcasm ?

Cheers,
DC1MC

wek · « **Reply #13 on:** November 06, 2022, 09:16:47 am »

The DWT unit in Cortex-Mx has a few "performance" counters, CYCCNT is one of them. In CM7, there's also a FOLDCNT among others. Might be fun to have a look a these, while doing this sort of hand-asm-optimization. I never needed to cycle-count so have no personal experience with these.

JW

brucehoult · « **Reply #14 on:** November 07, 2022, 02:34:03 am »

Quote from: DC1MC on November 06, 2022, 09:08:18 am

Quote from: brucehoult on November 06, 2022, 06:21:12 am
As a compiler engineer I'm not even slightly surprised that you were able, with quite some effort (how long?), to make code 30% faster than the compiler.

One of the most important properties is a compiler is to compile code fast. They do only those optimisations that they can do quickly. Most people would be upset if the compiler spent even 10ms on your kernel, let alone 100ms. In fact probably it spends under 1ms. Code gets compiled again and again and again.

What you want is some other tool that will happily spend 100ms or 1s or 10s or 10 minutes on your kernel and spit out amazingly optimal assembly language for that one exact pipeline. And you only ever run it once, unless the algorithm changes.

So far I kind of understand what's happening, but this let me scratching my head, where is it coming from, could you please elaborate ? Personally I don't mind and never knew or heard of anyone in the industry that will not trade even a day for a compiled code 30% faster than stock compile options ? Or was it some undetected sarcasm ?

You wouldn't like it if it took a day instead of 10 seconds to compile your project every time you made a change. You, having read the datasheet, know far more about the intricacies of the M7's pipeline than a compiler such as gcc or llvm knows. They know it's dual-issue and they have some heuristics to not put dependent instructions directly adjacent when possible. But that's about it. A skilled human with a day to spend will beat a compiler on optimising 50 lines of code every day of the week. And that's fine. Do what you're best at. The compiler is best at producing good competent, but not perfect, machine code from thousands of lines of code a second.

DC1MC · « **Reply #15 on:** November 07, 2022, 04:20:24 am »

@bruce Is there no middle road I've always wondered, after half of century of research and implementation of compilers is this what we get, a mediocre and somehow fast translation of language vs well, nothing, really, because I have yet to see the mythical animal that spends a day on 50 lines of code and produce perfect optimal code ?

All these -Omfg

and -m cpu=cortex-m7 for nothing ? AI is producing museum quality art and context winning poetry, and we can't describe a CPU architecture good enough to produce "hand un-optimizable" code, without a human in the loop counting painfully instruction cycles and trying to remember which little register bit buried in the 5000 pages description (over 20 documents) of the CPU may give some performance increase ? This is really sad

.

And if this very slow but wise animal may sometimes show up, we don't have to apply it all the time, for every code module, is not like we didn't solve at least this issue with the "beautiful" #pragma invention, what is one more to have

. Personally I love to tell Jenkins "Use -O "Absolute" and optimize the hell out of the code base over the week-end". Then the dream goes away and in three hours, back to gcc.

brucehoult · « **Reply #16 on:** November 07, 2022, 05:11:34 am »

https://www.gnu.org/software/superopt/

https://web.stanford.edu/class/cs343/resources/superoptimizer.pdf

This paper describes, for example, the quite unexpected discovery that on the 68000 the C function ...

Code: [Select]

int signum (int x) {
  if(x > 0) return I;
  else if(x < 0} return -I;
  else return 0;
)

... can be optimised to the branch-free code ...

Code: [Select]

(x in d0)
add.l d0,d0  ;add d0 to itself
subx.l d1,d1 ;subtract (d1 + Carry) from d1
negx.l d0    ;put (0 - d0 - Carry) into d0
addx.l dl,dl ;add (d1 + Carry) to d1
(signum(x) in d1)

Such trickery with the carry flag!

Interestingly, you can write this function in C as ...

Code: [Select]

int signum(int x){
  return x>>31 | !!x;
}

... which would compile quite badly on 68000, but on a modern ISA such as RISC-V is very simple and direct (argument and result both in a0):

Code: [Select]

srai a1,a0,31
snez a0,a0
or   a0,a0,a1

Anyway, the point is that such tools do exist, both for straight-line code and for optimising loops.

ejeffrey · « **Reply #17 on:** November 07, 2022, 05:22:13 am »

Quote from: DC1MC on November 06, 2022, 09:08:18 am

So far I kind of understand what's happening, but this let me scratching my head, where is it coming from, could you please elaborate ? Personally I don't mind and never knew or heard of anyone in the industry that will not trade even a day for a compiled code 30% faster than stock compile options ? Or was it some undetected sarcasm ?

For this particular application, a 100 ms compile time would obviously be irrelevant, but go compile a big project like the Linux kernel, xorg, chromium, those kinds of projects. They take 10s of minutes on the low end, and many hours on the high end.

You can do a debug build with minimal optimizations for a faster build, but ultimately even optimized builds need to be fast to be usable, both for developers and for automated regression testing.

Quote

@bruce Is there no middle road I've always wondered, after half of century of research and implementation of compilers is this what we get, a mediocre and somehow fast translation of language vs well, nothing, really, because I have yet to see the mythical animal that spends a day on 50 lines of code and produce perfect optimal code ?

There is a middle ground, that is why compilers have optimization levels. As for the "mega optimizer" that spends a long time on short sections of code, I wonder how good that would be in practice. Yes, you could write an optimizer that would solve this particular problem easily. But if you had 100 different heuristics and a good scoring system, how often would it actually help? How much would people *actually* use it?

Quote

All these -Omfg and -m cpu=cortex-m7 for nothing ? AI is producing museum quality art and context winning poetry, and we can't describe a CPU architecture good enough to produce "hand un-optimizable" code, without a human in the loop counting painfully instruction cycles and trying to remember which little register bit buried in the 5000 pages description (over 20 documents) of the CPU may give some performance increase ? This is really sad .

It does seem like those options should be providing some input to optimize instruction scheduling in exactly this way. If i had to guess, it might be that such a heavily unrolled loop that is using so many of the registers is simply too big for the optimizer. That's a lot of branch free instructions. It's not big for a global optimization algorithm like simulated annealing, but it might be larger than the heuristics in a compiler are designed for since they rarely have so many instructions to optimize without branches. The OP also took advantage of some high level structure such as the 1:1 ratio between loads and FP ops to simplify the problem.

Finally, it's probably relevant that this is the sort of problem that out-of-order processors are designed to solve. deeply pipelined, dual issue CPUs are... not exactly rare, but less common than smaller simpler cores or larger out-of-order cores. While several important CPU cores fit that description, I wonder how much compiler writers really optimize for that case. After all, if generating efficient code for that type of system was easy, we would all be using Itanium.

brucehoult · « **Reply #18 on:** November 07, 2022, 08:51:54 am »

Quote from: ejeffrey on November 07, 2022, 05:22:13 am

Finally, it's probably relevant that this is the sort of problem that out-of-order processors are designed to solve. deeply pipelined, dual issue CPUs are... not exactly rare, but less common than smaller simpler cores or larger out-of-order cores. While several important CPU cores fit that description, I wonder how much compiler writers really optimize for that case.

Yes ... an exact point that I made in my head but didn't get into a comment.

The A53 is probably the most important of these in the current market, and Pentium and PPC601 and PPC603(e) for a brief period in the mid 90s -- 2 1/2 years from Pentium to the OoO Pentium Pro, for example.

Even the A55 has just enough pseudo OoO in the pipeline (as do SiFive U74 and Western Digital SweRV in the RISC-V world) that code scheduling into dependency-free pairs becomes a lot less critical than on A53.

But the vast vast majority of cores are either single-issue in-order or else fully OoO.

ZaneKaminski · « **Reply #19 on:** November 07, 2022, 09:36:34 am »

There is a funny disconnect. Over here on EEVBlog we're all in agreement that the compiler is only so-so at writing branchless "non-tricky" code like this DSP unrolled loop stuff even though we can articulate the algorithm to write the perfect code in certain cases. Whenever I bring up writing assembly language to my software developer friends, they are are all like, "assembly? who uses that anymore" "just use the compiler since it generates very efficient code" "the [Intel|AMD|etc] compiler knows much more than you about the target microarchitecture so it will schedule the instructions best". I guess GUI software, CRUD apps, etc. are branch-heavy code with lots of pointers, variables, etc. The compiler is (in my experience) actually pretty good at allocating the registers and organizing the branching in that kind of code. And of course it changes so much, it would be inefficient not to use the compiler.

Quote from: brucehoult on November 07, 2022, 02:34:03 am

having read the datasheet

Another not-so-nice aspect of the Cortex-M7--none of the instruction latencies are documented. There is a section on the instruction timings in the Cortex-M4 Technical Reference Manual. In the M7's TRM, the analogous section is missing! I guess that's part of why I was offended by the 3-clock FPU latency. I wish they told me! Way before my time, but I think Intel was a lot more specific in the Pentium's datasheet. I've got nothing official on what instructions can be issued together, latencies, etc. beyond the basic pipeline block diagram.

Quote from: ejeffrey on November 07, 2022, 05:22:13 am

There is a middle ground, that is why compilers have optimization levels. As for the "mega optimizer" that spends a long time on short sections of code, I wonder how good that would be in practice. Yes, you could write an optimizer that would solve this particular problem easily. But if you had 100 different heuristics and a good scoring system, how often would it actually help? How much would people *actually* use it?
...
this is the sort of problem that out-of-order processors are designed to solve

Just too bad there aren't better flags or options I can provide the compiler to tell it what to do here. I recall watching a talk by Chris Lattner about LLVM. One of the points that stuck with me was that after the code was translated to LLVM IR, you could sort of generically apply a number of optimizations to it. (Was my generated code bad because I'm using GCC? haha) Itanium failed because you can't actually solve the instruction scheduling problem statically, especially because over subsequent CPU generations, memory latency per instruction throughput increases. And what about speculating past multiple branches that you just can't predict with great accuracy? Here the latency is all very static so I wish there was a way I could write my loop in straightforward C and then specify how the unrolling and interleaving should be done. Explicit specification of the transformations would avoid the compiler needing to use heuristics or know about the target microarchitecture.

wek · « **Reply #20 on:** November 07, 2022, 11:28:00 am »

Quote from: ZaneKaminski on November 07, 2022, 09:36:34 am

Just too bad there aren't better flags or options I can provide the compiler to tell it what to do here.

Quote from: Donald Knuth, some 50 years ago

Program manipulation systems appear to be a promising future tool which will help programmers to improve their programs, and
to enjoy doing it. Standard operating procedure nowadays is usually to hand code critical portions of a routine in assembly language.
Let us hope such assemblers will die out, and we will see several levels of language instead: At the highest levels we will be able
to write abstract programs, while at the lowest levels we will be able to control storage and register allocation, and to suppress
subscript range checking, etc. With an integrated system it will be possible to do debugging and analysis of the transformed program
using a higher level language for communication. All levels will, of course, exhibit program structure syntactically so that our eyes can grasp it.

abyrvalg · « **Reply #21 on:** November 08, 2022, 03:22:40 pm »

Quote from: brucehoult on November 07, 2022, 05:11:34 am

https://www.gnu.org/software/superopt/

https://web.stanford.edu/class/cs343/resources/superoptimizer.pdf

This paper describes, for example, the quite unexpected discovery that on the 68000 the C function ...

Code: [Select]
int signum (int x) { if(x > 0) return I; else if(x < 0} return -I; else return 0; )
... can be optimised to the branch-free code ...

Code: [Select]
(x in d0) add.l d0,d0 ;add d0 to itself subx.l d1,d1 ;subtract (d1 + Carry) from d1 negx.l d0 ;put (0 - d0 - Carry) into d0 addx.l dl,dl ;add (d1 + Carry) to d1 (signum(x) in d1)
Such trickery with the carry flag!

Interestingly, you can write this function in C as ...

Code: [Select]
int signum(int x){ return x>>31 | !!x; }
... which would compile quite badly on 68000, but on a modern ISA such as RISC-V is very simple and direct (argument and result both in a0):

Code: [Select]
srai a1,a0,31 snez a0,a0 or a0,a0,a1
Anyway, the point is that such tools do exist, both for straight-line code and for optimising loops.

Modern x86/amd64 compilers (MSVC I guess) produce such tricks automatically. Recent Windows executables are full of hardly readable constructs like ADD A, B, C; SBB D, E, F; SETNZ AL replacing things like bool x = ((y>42) && (y<63)).
ARM GCC's output feels more architecture-/platform-agnostic, even compared with commercial ARM compilers (i.e. no MOVT usage, no data section compression - the latter is really annoying, a nonconst 1k struct with a single initialized nonzero byte would occupy full 1k in flash to be copied to RAM).

jnk0le · « **Reply #22 on:** November 13, 2022, 12:05:24 pm »

Hi,
here are my notes from cm7 analysis: https://github.com/jnk0le/random/tree/master/pipeline%20cycle%20test#cortex-m7
Not the most up to date, and didn't touch FP so far.

My advice is to try llvm, they had better pipeline model than gcc, the last time I checked. (it's coremark went from 5.01 up to 5.29 on ARM website for a reason)

jnk0le · « **Reply #23 on:** June 28, 2024, 12:17:00 am »

Just touched the M7 "vloating" point, and the best approach seems to be what was discovered on arm forum long ago:

- mass load
- mass FMA (dual issue with integer code if possible)
- mass store

FMA has to be interleaved by at least 3 independent accumulations.
FMA cannot be dual issued with vldr/vstr even though it is said to have 4 read and 2 write ports.
Load to use latency is 1 cycle except when consumed as FMA accumulators, where it's 3 cycles

Also, if you want addition after executing FMA (within 3 next cycles) then FMA against 1.0f instead of simply adding.

EDIT: another approach could be to interleave FMAs with 2 or 4 loads/stores to get rid of accumulator interleaving

Quote from: brucehoult on November 06, 2022, 02:29:56 am

Quote from: ZaneKaminski on November 04, 2022, 09:03:01 pm
So with hand-optimized assembly, I was able to get a 1.85x speedup per clock on the convolution inner loop.

Sorry .. a 1.85x speedup compared to what? The M4F? So you've got near perfect dual-issue (85% of the time) despite the longer FPU latency? Sounds pretty awesome!

Dual-issue speedups are often in the 1.4x to 1.6x range, which is usually reckoned to be an excellent efficiency gain as they only use about 15% more hardware.

And that's in addition to the 216/180 = 1.2x faster clock? So you've actually got 2.22x performance increase?

M7s are available in 600 MHz and the one on my Teensy 4.0 seems perfectly happy (if a little hot) at 960 MHz, as in perfect performance scaling to that speed.

M4 FMA is 3 cycle or 2 cycle for scheduled MUL+ADD. So in theory, dual issuer (with 1 FMA pipe) should be 4-6x faster in FMA dominated code.

bson · « **Reply #24 on:** June 28, 2024, 09:26:52 pm »

Quote from: ZaneKaminski on November 07, 2022, 09:36:34 am

There is a funny disconnect. Over here on EEVBlog we're all in agreement that the compiler is only so-so at writing branchless "non-tricky" code like this DSP unrolled loop stuff even though we can articulate the algorithm to write the perfect code in certain cases. Whenever I bring up writing assembly language to my software developer friends, they are are all like, "assembly? who uses that anymore" "just use the compiler since it generates very efficient code"

Compilers are great for larger bodies of code, which frankly is pretty much anything these days. Just changing a constant from 1 to 0 can make it completely overhaul register usages, stack ordering, inlining, removing unreachable code, change unrolls, etc. Doing this on scale in assembly means significant rewrites, which means it isn't done so the code ends up showing its history. Large assembly code bodies also rely heavily on conventions, which then means due to the lack of inlining (lots of small functions usually, with inlining being in the form of macros) there is tons of "glue" code to rearrange register contents between function calls and even macro invocations.

And while compilers generate pretty good code mostly, there's no doubt sometimes assembly is a win. But it's also nice to have a more naive reference implementation in C/C++ to compare test results against, to verify that the highly optimized assembly implementation works correctly and doesn't regress if touched. This also makes porting easier, in that initial bringup work can focus on functional correctness and punt on optimization. (It's hard to optimize something that isn't running to begin with, and even counterproductive often as performance optimization should ideally be data-driven. I.e., the numbers determine what's a good optimization and what's counterproductive.) In most cases also, naive code is just fine while being more maintainable and understandable. I'm a big fan of writing code that's simple and straightforward, and which does what it needs to and nothing else. Compilers usually like code like this as well and generate better code than for do-everything frameworks and engines with a gazillion configurable constants, buttons, and levers to pull.

But yeah, optimized assembly code, limited to the functions that actually benefit from optimization is definitely a thing. Not sure how anyone could think otherwise.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Cortex-M7 pipeline idiosyncrasies… (Read 8177 times)

Share me