Having a push pull output stage helps a lot, as it avoids having the output stage in a slow low current mode. The improvement might not be that large for the transient test shown, but without a 2 quadrant output stage there are essentially always troublesome cases.
It also helps to have the fast local feedback an the transistor level not directly from the output, but separated from the output by a low inductance resistor. This could be the emitter resistors for load sharing / bias current adjustment. This way the local loop can be faster as it is not seeing possible extreme pure capacitive load.
For the output capacitance a combination of capacitance with very low ESR and some with ESR is good. The low ESR part for the very fast transients (e.g. << 1 µs) and the ESR part to help with stability of the loop. The capacitance is essentially the only thing to limit the drop until the transistors can respond.
A fast output transistor also really helps - there are fast audio grade BJTs, like 2SC5200, about 10 times faster than the 2N3055. However parasitic inductance might be limiting earlier. So there are limitation to what is possible in real life. Low impedance at high frequency means that even small (e.g. 10s of nH) parasitic inductance can be important.
For the simulation I would start with just 1 transistor and add parallel one later - it is less confusing and easier to change.