undocumented DISFOLD bit needs to be set, too, for predictable branch timing (write 0x04006004 to ACTRL)
Ooh thanks for the info. There weren't any branches in the DSP kernel anyway anyway so I didn't worry about that. It's unrolled 192 times over.
There is a third way. DWT unit has DWT_CYCCNT register, which shows the CPU cycle count.
Yeah, I figured there would be something like that but I just used the general-purpose interval timer which I had already figured out how to use rather than go through the docs. I'll use this in the future though, thanks.
So with hand-optimized assembly, I was able to get a 1.85x speedup per clock on the convolution inner loop.
Sorry .. a 1.85x speedup compared to what? The M4F? So you've got near perfect dual-issue (85% of the time) despite the longer FPU latency? Sounds pretty awesome!
Dual-issue speedups are often in the 1.4x to 1.6x range, which is usually reckoned to be an excellent efficiency gain as they only use about 15% more hardware.
And that's in addition to the 216/180 = 1.2x faster clock? So you've actually got 2.22x performance increase?
M7s are available in 600 MHz and the one on my Teensy 4.0 seems perfectly happy (if a little hot) at 960 MHz, as in perfect performance scaling to that speed.
Well, that's with the hand-optimized code. I was sort of lamenting the fact that the arm GCC compiler did quite a bad job compared to my handwritten assembly. I can't remember the exact figure, but yeah, it was in the 1.4x range. This was particularly disappointing because from a compiler-algorithmic standpoint, what I did in the handwritten assembly to get the 1.85x speedup wasn't too complicated.
I started with a naive, unrolled implementation in assembly. The implementation used 29 floating point registers (out of 32). Having a few spare registers is key for the transformation I did. Knowing that the M7's FPU latency is 3 clocks and knowing that I had equal numbers of loads and FPU ops, I cut-and-pasted all the loads from the naive implementation into a separate table which would form the backbone of the fully-optimized implementation. Loads can't dual-issue with other loads so my goal was to pair each load with an FPU or ALU operation. With the loads isolated from the FPU/ALU ops, I reordered the loads to minimize "latency skew" between two loads whose results would then be used by the same instruction. My original assembly code reused some registers, so I had to rename all the registers to temporary virtual names. Then I started adding in FPU ops to the big table of loads, pairing them with loads and renaming the registers in the FPU ops for consistency. The first few loads had to be unpaired since I assumed a 3-clock latency between loading and using the result. I didn't measure load latency so I chose 3 clocks to match the (known) FPU latency. I think the load latency from internal SRAM is actually two clocks. Were I working from external RAM, maybe I'd assume more load latency. Because of the load and FPU latency, I had to space out dependent operations within one iteration of the unrolled loop and interleave them with independent operations from subsequent loop iterations. The FPU and load latency determines the interleave factor and the number of operations per loop iteration determines the span of time over which a single loop iteration's result is calculated. Once I had everything written out, I re-renamed the registers from virtual names back to physical registers. Since the computation of a single iteration's result was interleaved with the computation of the results of other iterations, a few more registers were required than I originally used in the naive assembly implementation.
Isn't this basic optimizing compiler stuff? You'd think it would do what I described automatically, since it's algorithmically pretty simple. I tried -funroll-all-loops and it unrolled the loop but it didn't interleave the iterations properly to hide the FPU latency. So yeah, I knew I could get almost 2x if I wrote the code myself, I'm just disappointed in the toolchain. And the 3-clock FPU latency is sort of inconvenient, though not a big deal since you just have to do a wider interleave factor between independent loop iterations to compensate. I would also add that I stuck this big unrolled loop in the middle of some C code as inline assembly. I chose the registers to use and moreover, I specifically put a few instructions at the beginning and end which are not paired for dual-issue. These few unpaired instructions at the beginning and end are responsible for the speedup being only 1.85x instead of 2x. If only the compiler could do the optimization I described, maybe it could find some stuff in the preceding or subsequent C code that could be dual-issued with those unpaired instructions in my inline assembly. Or maybe it could do the register allocation better. My optimized implementation doesn't use register s1 but that's caller-save in the usual calling convention. Maybe using s1 and avoiding the use of one of the callee-save s16-s31 registers would eliminate a push/pop in the surrounding function. Too many things to consider for such a small gain and as you said, there's a 2.22x performance increase so the problem is solved. Just disappointing that the compiler doesn't seem to know much about the latencies in the core and that it doesn't try a slightly tricky optimization like I described.