So what would bare metal programming gain me?
Not necessarily anything, but it depends.
You oversimplify the question a bit. There are actually more to this:
* Bare metal programming vs. using pre-made hardware abstraction libraries,
*
Which hardware abstraction libraries to use, and do they help or hinder;
Then a completely separate question,
* Which tooling to use - only barebones GNU tools, or an added point&click layer on top?
Only now we can start to answer your question.
The main marketing argument for hardware abstraction layers is portability. Because of non-existence of standardized microcontroller peripheral API, and total difference of each HAL, and the fact that between different MCU families, code needs manual porting anyway, this argument is completely out of window.
The secondary argument, ease of development, has some merit to it. OTOH, as your projects are very simple, for example initialization of UART bare-metal in STM32 is three lines of code. I write it in 5 minutes from scratch, or copypaste it from previous project in 30 seconds. Either is fine. Looking up how the same is done in HAL from HAL manuals, or copypasting from previous project, or letting Cube to autogenerate is, is pretty much exactly the same amount of work. No difference here.
So all this isn't me being against Cube or HAL, all this is me being against the idea of Cube and HAL being somehow mandatory or extremely beneficial. Actually, just "Keep it simple" and "avoid excess layers" is argument enough not to use them.
Now to the "what you gain" question. Not necessarily anything, that depends. I feel I gained my whole "career" if you could say so, but this is impossible to prove without "alternative reality" to be compared against.
Again two sub points:
* Being tool agnostic (or limited to the minimum common denominator, the GNU tools), allows you to jump between different manufacturers quickly. No problem using NXP or Nordic Semiconductor in between STM32 work. All works the same! Everything's familiar! Same linker script syntax works, same initialization code works, even the way manuals are presented is familiar. No need to install anything to your computer!
* Being in total control of the peripherals becomes important right when you hit an edge case where the HAL simply is unable to do the job. For the very simplest jobs, this might not happen, but I would be surprised if you have not seen this yet. It's a very common occurrence if you read this forum, for example. And oh boy people really struggle trying to achieve something with the libraries which the library is not designed to do (even if the underlying peripheral supports it).
Examples of this is: making peripherals co-operate by existing on-chip routing; managing DMA buffers differently than the HAL designer thought; very fast ISRs (for example, just a week ago, I
had to write ISR-based state machine which processes data at 555.5kHz interrupt rate. There is simply no room for any bloat, not even 10 instructions. I need to be in control! Now the common way to deal this today as evidenced on this forum for example, is just to shout "no can do" and give up. I don't have this problem because of choices I make. Addition of CPLD was considered. It was not needed, which simplified BOM and also the workload of not needing to write VHDL. See how everything's connected? By having as much choice and "power" as possible, you limit yourself the least.)
Any suggestions for the "best" site to explain how to setup a bare metal project for STM32? What tools (IDE, complier, libraries) would you use? My guess is the compiler would have to be GCC. But how do you tie everything together?
Sad part is lack of tutorials.
The quick answer is, install gcc-arm-none-eabi package. IDE is your free choice. I use text editor and makefiles. You don't even need to use makefiles if this seems difficult. Compilation really can be done with one-liner command. No libraries, except for standard C library and special needs (like filesystems etc).
Tie everything together? There is not much to "tie" so to speak. These are actually simple devices! All you need is startup code, linker script, actual code, and run gcc to provide .elf you can flash. Maybe objdump for conversion of the output file format. To write peripheral code, you look at the reference manual and register descriptions, exactly like you did on PIC or AVR.
But this is just the big picture. Small details always takes the most time. There, a proper tutorial would be helpful. As much I'd like to write it, I'm not too great doing that; also it takes a lot of time and effort to teach something, even if it feels completely trivial to oneself.