It will work because, being burdened by a lot of extra* code, it will give time to the clock to complete its transition before one has a chance to deassert NSS via SW.
Also, it is internally using BSY, against the RM recommendation and including a 1 ms extra timeout to cater for the BSY bug (where it can stay high at the end of a transaction).
This kind of "fail every now and then by unexpectedly hanging around extra 1ms because we write buggy shit" is disastrous to any project sensitive with timing.
Obviously the STM32 HAL cannot be used when timing of SPI communication, or product reliability is critical at all.
It is quite difficult to completely abstract SPI communications as high-level software calls. This is because,
1) SPI devices have so varying and weird timing requirements,
2) projects are so vastly different; some need cycle-accurate timing to toggle nCS (e.g., ADC sampling (jitter)), some need burst readouts, some need to instantaneously deal with data, others collect them into long buffers, many deal with multiple slaves on single bus,
3) some projects are really tight on performance. For example, having multiple SPI devices on a 20MHz SPI bus, where each device requires their own software nCS, and CPU is running at only say 48MHz, and this data needs to be dealt with and pushed further somewhere else. Maybe you
can deal with 20 cycles of ISR latency, but you cannot deal with 20 more cycles wasted on inefficient library call, or worse, 48000 cycles on a failed attempt to work around a bug, when better workarounds exist which are suitable for
your project, but not easy to write as generic library calls (see my above "use a timer" approach which sidesteps all SPI peripheral HW bugs).
It is not a sin to NOT abstract a peripheral into its own module in an embedded MCU project, if this ad-hoc strategy enables good performance, relative ease of proving (or at very least testing) worst-case behavior, and reliability. After all, the peripheral register accesses are
Just a Few Lines of CodeTM, so there is not much to be gained by abstraction and code reuse, and there is a lot to lose.
Given this thread, OP is already beyond the training wheel HAL level. They already know how the SPI internally works, and how it doesn't work, and also already know why their code does not work, and why the HAL does not work either, because it makes the same mistakes. The only way is up.