Within the stated constraints, I do believe a video ADC like
TI TVP7002 (HTQFP-100, 7.71€ at Mouser in singles, uses 3.3V and 1.9V supplies) combined with an FPGA is a must. It is what is used in
OSSC, the Open Source Scan Converter (low-latency video digitizer and scan conversion board used by many retro video game enthusiasts), and the video quality should be quite good.
VESA 1024x768 60fps mode uses a 48kHz VSYNC and 60MHz pixel clock (1344 pixels per scan line). Basic 640x480 60fps has a 31kHz VSYNC and a 25MHz pixel clock. These are all well within the TVP7002 capabilities. TCP7002 is controlled using I2C, so you also need a small microcontroller to adjust configuration and parameters, while an FPGA selects the parallel RGB data to be sent, mixing in the UART channel(s), and uses a serdes or PHY (XAUI/SFI) to interface to a SFP+ fiberoptic module.
The serdes/PHY is a problem. It converts parallel data to the GHz-range serial (differential) pair the SFP+ module uses, and vice versa. Some FPGAs like Lattice ECP5 do have built-in serdes for 3.2Gb/5Gb for up to four lanes; you need two lanes, one in each direction. Only the ECP5UM and ECP5UM5G series have the serdes, and they're BGA only, but they might be able to drive an SFP+ module directly (see
here, XAUI mode).
Using a 10GBase-SX SFP+ module and layer 1 (I believe!) Ethernet (jumbo) frames for the data, for 38 bytes of overhead per 46-9000 payload data bytes, the data would be proper 10G Ethernet stuff you could route within a LAN. If a few first bytes identify the payload, you can use the same format on both legs of the transmission. Note that on the other direction, you do need to buffer some UART bytes, as the overhead is a bit much (8400%) otherwise; i.e. burst, instead of send/receive individual bytes. I'd include a few header bytes in the payload, identifying whether it contains a full scanline (and if so, which scan row and frame/field number), and whether one or more UART payload bytes follow. Or you could use UDP/IPv4 datagram(s) within the Ethernet frame, so different ports would be used for different UART connections, making the format IPv4-routable, and easier to extend (for example, more than one UART link).
Let's say you limit to a 1Gbits/second bandwidth instead. The practical data bit rate including Ethernet frame overhead is about 800 Mbit/s. This suffices for 1024×768 24bpp 30 fps (a little under 567 Mbit/s, plus framing overhead, plus the UART data). This would require you to have a full framebuffer of about 19 Mbits (2.4 Mbytes), because the VGA input data rate is about twice that. The other option is to transfer 60fps interlaced, every other scan line per field, so that only a few scan lines need to be buffered at a time, also leading to much lower latency; but also leading to tearing, since the odd and even scan lines would be from different consecutive display frames. It would really be 1024×384 60fps, in other words.. The FPGA would act as the MAC, reading the TVP7002 parallel output selecting and buffering only the visible part of each scan line, while pushing the previous one a byte at a time at 125 MHz to a gigabit GMII PHY, followed by a media converter to fiber (1GBase-SX).
I think, that is. I am not absolutely certain of any of the above, not having done it myself before; the FPGA part in particular is new to me. The state diagram isn't complicated, unless you include stuff like mode autodetection; using a separate microcontroller to measure HSYNC/VSYNC and control and configure both the video ADC and the FPGA should make things much simpler, modular.