The ST lib is exceptionally bad in code size and thus also speed.
For example, take HAL_RCC_OscConfig from STM32Cube. This is a 315-line super-function, that takes a pointer to a large struct. The struct contains a bit-mask to select which osc to configure (HSE/HSI/LSE/LSI/PLL). Then the struct contains config variables for all of the oscillators. The function checks those bits in the bitmask and splits to 5 conditional branches. As a project usually does not use all of the oscillators, the user is left with lots of dead code in the conditional paths of this function, that never get executed. In addition to that, the configuration struct contains configuration variables, that are never read. Usually this function is used with a constant input once at the beginning of the code, so depending on the input a lot of dead code could be avoided. For a compiler to clean up this mess, link-time optimization is required. Even then, the code makes it especially difficult for the compiler to find the dead code and variables and elliminate them. Why isn't this thing split to separate functions, one for each osc? It would be much better not to introduce dead code at all, not hope for some compiler magic to clean up the mess (unless you are trying to sell a specific compiler that happens to do that?). This is OK for a generic-use library, but not for a embedded system.
Another example is HAL_GPIO_WritePin and similar functions in .c files. In case the target state is constant/known (and not provided in a variable dynamically), the task to do is to write to a 32-bit register a 32-bit value. In the STM library, you get a function that takes a pointer to GPIO reg base; bit mask that may have only one bit set and boolean. It is in another object file, so the compiler can not inline and optimize it. Again, unless link-time optimization is used, the simple worst case of 4 bytes register address, 4 bytes register value (or we may even use shorter values) and some bytes to perform the write takes now one more constant and many more cycles to execute.
So, the ST libraries just waste the resources. It is quite easy for non-beginners to write your own code and save huge amounts of code space and execution time. The waste of resources is in this case not an issue of ARM cpu's or 32-bit uCs in generic but just a bad implementation of the library.
One may argue that writing your own code takes a lot more time. When adding up time learning the non-obvious APIs, time debugging some issues that those APIs cause (why does the re-configuration of sysclock source from to HSE+PLL need to magically set up periodic systick timer interrupt that was not used? did I ask it to?) and debugging the performance issues later on, it can be wise to avoid bad parts of such libs and use your own.