I've had an ISR on an AVR update a Quadrature counter in around 50 CPU cycles, and up 100.000 ISR/s is possible, It can even be done reliably, but an absolute must is then of course that it has to be the only ISR in the system.
Finnicking with bits on this level was fun 20 years ago, and is still a good exercise for students, but the availability of EUR2 uC's with built in hardware for the counting wipes away the competition.
So just do your homework and select a uC with this capability.
Your mention of "I really need 16 bits, otherwise the communication overhead defeats the purpose. " is a strong indication you're struggling to get this to work on on some 8-bitter.
I'm not saying this can not be done. I've been there and done it myself. But it's really a stubborn remnant from 20 year old practices.
" I honestly don't want to start learning a whole new workflow method and invest a ton of time and money into getting the programmers and the software etc working."
This is also recognizable. I've been both afraid and curious of the 32-bit uC world, and I've taken the "difficult route", and regret it. Very recently though I stumbled into the book:
Beginning STM32 Developing with FreeRTOS, libopencm3 and GCC.
ISBN-13 (pbk): 978-1-4842-3623-9
ISBN-13 (electronic): 978-1-4842-3624-6
https://doi.org/10.1007/978-1-4842-3624-6It's written by Warren Gay, call sign VE3WWG.
It's very well written, has all the info I scrambled together in the few years I tinkered with the "Blue Pill" board after tempted by their low price.
(Warning: The days of the "Blue Pills" are over. The ceap boards from Ali / Ebay can have any of some 8 different clones, with various incompatibilities. Don't buy them. Maybe the stm32f4xx series is a better starting point, or the STM Discovery boards)
You can buy STM32 "Discovery boars" for EUR20 or so, and they have a built in programmer. Hobbyists can also put one of the available open source projects in an STM32 and build a decent programmer that way. I don't know what the official tools cost, but very likely below EUR 100 these days. Compilers and IDE's are also free. GCC rules almost everywhere, though some vendors try to hide their toolchain is built around it.
Ah yes, your first projects for a 32bit ARM uC will probably be a bit of a struggle, but after that it's not much different from working with the old 8-bitters. You just need a template that works for you to set up the main clock, clock domains for I/O, configure multiple ISR levels and some other things.
Digikey still sells the LM628. It only costs USD58.
LM628 is also designed for controlling Brushed DC motors (single PWM output?)
It may be a compromise for your application to make it a 2 processor job.
Putting some PID loop into a 32 bit uC capable of hardware quadrature counting is a relatively small job. With a little adapter PCB you can even make something that's pin compatible with the LM628, although almost all 32 bit uC's are 3V3 only.