Unfortunately I didn't know where to look for the
stacked value of LR. But the key was the realisation that the text of an assert message was within the RTOS stack area, i.e. that message was being output. I should have searched the project for that message. FWIW, this one (an overflow of a 128 byte debug facility buffer) has bit me before, and I fixed it yesterday by using vsnprintf(buf, sizeof(buf)-1, ...) instead of the previous vsprintf which was created about 3 years ago. So this one won't happen again.
The code has now been running for an unusually long time, and a breakpoint on the assert (telling me that the malloc returned a NULL) has not been seen. Time will tell...
Some candidates:
The ST USB MSC and CDC code used a malloc at the start, and it was assumed these blocks will never be freed (a legitimate use of the heap in embedded). But they do get freed because there is a "DeInit" function (for a bizzare reason ST produced a DeInit version of almost every hardware init function) which gets called if there is a break in the USB connection! And USB has a habit of going to sleep (on the windoze end). So one would have been getting lots of fragmentation due to this. Stupid stupid code by ST - this is supposed to be an "embedded" system! And it would vary according to the USB controller; one at work is probably better behaved and so the board running on my desk there runs for much longer (I work at work and at home; same project). The two blocks, around 600 and 200 bytes, have been replaced with static buffers.
I have over time checked code for malloc use and removed these, but in some cases it is simply not possible e.g. MbedTLS uses a lot of it, as does FreeRTOS, but these use a private heap. TLS chucks away the whole heap when a connection is closed so that should be ok, but you never know... FR seems to allocate and never unallocate.
ETH global interrupt was not used but was left enabled. The ISR (in the ST HAL stuff) was not pointed to by a vector, so got optimised away, and a callback routine in ethernetif.c (which was called only by the ISR) got optimised away too. I found this in some bizzare breakpoint behaviour. Try setting a breakpoint on code which the compiler later removes... No evidence that interrupt was actually being generated though.
The printf family is still
not mutex-protected but commenting out a %7.3f use in an RTOS task has not helped. I am working on this but it is messy
https://www.eevblog.com/forum/programming/st-cube-gcc-how-to-override-a-function-not-defined-as-weak-(no-sources)/Last night I was tracing through some long and float printf code and while it calls __retarget_lock_acquire_recursive and __retarget_lock_release_recursive (with r0=0 which I think is the function parameter, in this case a handle, but does a recursive mutex need an individual handle) and I no longer saw the calls to malloc() which I saw before. But then I did add -u _printf_float to the linker options to remove a warning on the use of a float printf (which was emitted even though it did actually work; I don't understand that). Now I should be using the Standard C library with the newlib-nano unchecked. Maybe it does a malloc only in some more complicated cases. Also I notice that it does not do any sort of mutex initialisation; maybe that is not needed for recursive mutexes?
Thank you all