Pretty sure I could do that with the AVR-DA I've been playing with. It could be something like: input into event into TCD "fault", fault clear --> begin output pulses (delayed by, I think 1-3 clock cycles, at 24MHz max), TCD overflow event --> TCA count event, TCA overflow --> disable TCD or input event to terminate pulse train. And if the pulse train isn't too fast, TCD overflow interrupt could set new PWM/freq settings during it.
Assuming you mean pulses triggered by pulse.
Oh, if the input isn't a sustained level, then route it through the CCL (configurable custom logic) first, as a flip-flop say.
I think much of these peripherals have been in use, in some form or another, in other MCP products -- tinyAVRs have 'em, I think lots of PICs had CCL or something similar, maybe some of these more advanced timers I don't know, probably something that can be adapted.
As for raw interrupt latency, not sure about other systems, but AVR is fairly exemplar on this, but it's rare you get to use it at its fullest capacity, if ever. The interrupt itself occurs in a couple cycles (finish current instruction + jump into interrupt vector), but the IVT only has space for two words, traditionally a JMP .isr instruction. If you already have address and value loaded into dedicated registers, you could put a ST [X], r25; RETI or whatever in there, and that's it. If you guaranteed aren't using the interrupts immediately following it, you could write more code into the IVT. And, if compiled, the typical overhead goes something like: IVT JMP, preable (PUSH stack pointer, flags, registers), actual code, postable (POP everything), RETI. Which will give something like a fractional microsecond at the best of times; actually, not quite true, I've seen GCC emit the actual bare minimum when it's a copy between fixed memory locations -- no need to dirty multiple registers, save just the one and go.
Similar things apply to other platforms; ARM for example stores the return address in LR, which you'd better push to the stack before doing much else (and depending on when/how interrupts might be enabled during the current ISR, if that's optional or automatic or whatever?). Maybe some operations can be done entirely from register, maybe you need to push a whole bunch of them to do much of anything.
And yeah, because of pipelining, the interrupt itself costs at least that much; as a result, latencies tend not to go down much, even as you go up in frequency and power. Even if the CPU is that much faster, if all the IO operations have to propagate through bus interfaces and caches, it's still true. So like, buffered IO is mandatory for modern peripherals, that can accumulate fractional MBs in the time taken to empty their buffers. Fast response is best handled by hardware or configured logic, which, brings us back to the first point: now that MCUs are often integrating such logic, it's quite possible now that something even sub-cycle can be pulled off with them. But you'll have to go shopping, and read datasheets in detail, to figure out whether any particular thing is possible.
Tim