40MHz input wiht 20MHz bandwidth limit just shows a reduced amplitude 40MHz signal.
No anti-aliasing though as a 10MHz input will show (and measure) 1Hz if you slow the timebase down enough.
The filter has to be done in hardware. It would break all sorts of things if it was in software.
One needs some anti aliasing filter in hardware. However this does not have to follow a reduced data rate or the standard 20 MHz. Only aliasing for actual sampling rate of the ADC is to be avoided. The actual conversion rate of the ADC and the number of samples stored don't have to be the same. It is well possible to keep the ADC running at the nominal speed (e.g. 2/4 GSps) and do the filtering to avoid aliasing from a lower data rate digitally on the high speed data together with decimation. This does need quite some computational power however (could be part of the ADC chip or in the main FPGA). So it is not clear if and when this is used. Some scopes offer this as averaging over consecutive samples as extended resolution mode. So it could depend on how the reduced data rate (shown as samples per second) is set. There may be a mode with aliasing and a different setting without aliasing.
If they incluse digital filtering, they could as well also use it for the 20 MHz and maybe the limited BW versions.
Tektronix does it that way on their older DSOs and MSOs. The bandwidth limit function includes both hardware and DSP options, and they perform slightly differently. As far as I know the software bandwidth limiting occurs during decimation so the digitizers always runs at full speed, which is normal for all DSOs. Filtering before decimation in real time requires considerable hardware computing resources, but that is getting cheaper all the time.
Doing the filtering in DSP before decimation is a variation of "high resolution" mode available on many DSOs. High resolution mode is just boxcar averaging which is trivial to implement with shifts and adds at high speed, but makes for a poorly performing filter. A finite impulse response (FIR) bandwidth filter requires multiply-accumulate. Individual multipliers are not usually fast enough even now, but a parallel implementation of a FIR filter solves this.