Phew...python skills....I sorta know how it works, but I don't really like it. I'm more a java guy. Your C rewrite might be easier to read for me.
You think, you can get that many data across USB with a constant data rate? Might be some mouse movement come in between, or so...?
Yeah, there's gonna be some tricky code in there for the time being because we're getting close the limits of performance on CPython (and threading is unfortunately sort of nerfed in CPython anyways which makes life even harder). I am going to try and roll a new oscilloscope program, so you might just watch out for that; I am debating how much if any needs to be done in C (really need a high speed ring buffer).
As far as how much data can be moved, the big thing is you have to make absolutely sure the device is on a bus by itself. You can do this by running lsusb on Linux (not sure what the equivalent for Mac or Windows is), and checking to make sure no other devices on the same bus. As far as moving data, USB 2.0 kind of caps out 30 MB/s so that really sets an upper limit. With the help of Jochen Hoenicke, we've already changed the firmware and driver in a few critical ways that make it significantly more performant than what is available stock. First, Jochen has removed dead code from the firmware, and added an option to only pull data from one channel (by default it always pulls both); this allows you to expend all your USB bandwidth getting more samples for a single channel. Jochen has also added more sampling modes; the default maximum was 48 MSa/s which isn't realizable with USB 2.0 and produces super dirty signals. Jochen has added a new 30 MSa/s mode, which in my testing is way cleaner, and I have been able to
stream at 30 MSa/s with less than 2% data loss. On the driver side, I pulled a lot of the stupid Hantek behavior out; the original Hantek driver will try to clear the FIFO on every read, which means you always will lose data, and waste USB bandwidth. The stock driver also did everything synchronously and did not queue up bulk transfers, leading to more data loss and poorer use of bandwidth. I've got an async implementation that makes these changes, and I can get really looking traces at reasonably high speed (30 MSa/s); ideally with a better C implementation or some clever use of multiprocessing, I will be able to realize this performance in an application.
What does this mean? You'll probably not get the advertised "20 MHz" performance exactly, but you'll probably get 5-7+ MHz single channel (or better, with good interpolation), and probably 3-5+ MHz dual channel, and hopefully get good consistent triggering. A lot of this is going to involve lots of interesting software magic.
The board also appears to have traces and holes for an external trigger, I am still examining what it will take to get this integrated into the firmware.