I have a custom core (I _will_ follow with details at some point) on a Raspberry PI. My setup is an OpenVizsla board (random FPGA board that I had lying around) receiving LVDS data (FPD-Link II; an interesting topic in itself), feeding that over an FT2232H, and displaying that on a RaspPi (3B, non-plus). The FPGA packetizes the LVDS data and reduces the data slightly (FPD-Link II is 24 bit/pixel payload, but 16-bit are sufficient; 14-bit of thermal data + 2 bits for sync). The RaspPi receives the data (via scanlime's "fastftdi",
https://github.com/openvizsla/ov_ftdi/blob/master/software/host/fastftdi.c ), and does the pixel processing (all in software).
I then either output that in a eglfs SDL2 image, or via eglfs qt. (qt because I wanted to provide a UI, but I struggled there...).
The data is around 324x256 with 30fps, and I can display that in realtime.
Some notes:
- Raspberry Pi's USB (dwc2) is _a piece of crap_. (If you look into how it actually works - it's a horror story. One of the issues is that you need to service the stupid USB controller for every microframe. You literally need a full CPU core just for that - and in fact, that's what's for example Windows 10 IoT on Raspberry Pi is doing. ).
- In Linux, just use multiple cores. I have one USB receive thread, one pixel processing thread, one display thread. They get nicely scheduled across the quad cores.
- My pixel processing is plain C code. No fancy shader. I receive a uint16_t framebuffer[2][325][256] (i.e. double-buffering, so I can fill the framebuffer with USB data while the pixel processing thread is running on it; otherwise I would have to serialize or sync the operations - but hey, memory is plenty!), I do histogram equalization (see
https://stackoverflow.com/questions/34126272/histogram-equalization-without-opencv ), which requires 1 pass over the image to build the histogram, 1 pass over the histogram (16384 entries) to build a 14-bit-to-RGB LUT, and then one more pass over the pixel data. Since for a typical image only few distinct pixel values are used, the 16384 * 4 bytes LUT cache performance isn't that bad.) The RGB values I derive from a pre-calculated palette with, say, 1024 entries. (So it's really 14-bit-to-10-bit-to-RGB but that's fair enough)
- The end result is a 32-bit framebuffer that I either poke directly to /dev/fb, or submit via whatever GUI library I'm in the mood for.
The most critical path is the USB bandwidth. 324x256x30 is roughly 2.5 MPixel/s, which works with 16 bit per pixel (~5MByte/s), but not with 24bpp. I prototyped the setup with a Zynq-based ZynqBerry, which worked well, but was too expensive in the end.
I can see to share my code but it's super crappy, and I don't have time to work on it right now.