The options discussed thus far are also not the only ones possible, not by a long shot.
For example, having looked at STM32L151, and considered Benta's approach, a new option (that I have not tried myself, but sounds intriquing to me) came into my mind:
Consider a 16×16 matrix (for simplicity, the exact dimensions don't actually matter much), but with per-keyswitch diodes, and with the inputs in the same GPIO bank so they can be read using a single I/O access in parallel. When no keys are pressed, all 16 outputs are set high, and all inputs will trigger the same interrupt. Then, the device goes to low-power/sleep mode. (No current flows unless a key is pressed, so having the outputs high should not increase current consumption much –– the internal circuitry will consume some, though.) Any key being pressed will trigger the interrupt. When the interrupt is triggered, the device wakes up, switches to active scanning mode, using a loop roughly similar to what I showed above.
Because the MCU will do the scan in microseconds, it can then set up a timer to trigger an interrupt in about 500 microseconds (in my case; interval dictated by USB HID protocol; your choice may differ!), and again go to sleep until the interrupt triggers and it is time to do the next scan. (Actually, on the STM32L151, I do believe one can use the USB module to signal the next wakeup, so that the next scan will be triggered immediately/very soon after the USB HID message related to the current scan has been transferred.)
This is not technologically optimal, nor is it what Benta suggested, but on this kind of MCU it would give a lot of the benefits (the MCU would sleep most of the time, and only do real work while at least one key is being depressed), with minimal additional hardware (the diodes; I prefer SOT-23 Schottky dual ones, like BAT54C and so on, that cost something like $0.02 apiece, and you only need one per two keyswitches; and perhaps 10kOhm current-limiting resistor per row/output pin, to safeguard against accidental shorts).
Would I do this myself? I don't know, I haven't tried this. My point here is that there are many options, and like Benta said, it starts with considering your hardware options and especially MCU first. This was just an idea that came up after participating in this thread. Because it is so cheap and simple and has worked well for me thus far, I personally am quite stuck to the scanning matrix and software debouncing approach. You do not need to be!
If your target will not use USB for the keyboard interface, your MCU selection widens radically, too.
In that case, I personally would use something like a Teensy (because I have them) with a native USB interface, to emulate the target machine, by bridging the keyboard interface in the target machine (emulated by the Teensy) to your development PC as USB HID messages. With
Teensyduino, keyboard emulation is trivial: you'd only need to implement the translation layer from the target machine keyboard interface to Arduino code –– perhaps I2C, SPI, UART? Having the keyboard then connected to the host computer and act as a normal keyboard means it is easy to test it. Again, just my own lazy-ass approach, not the recommended/only way!
The above-linked page also describes the fact that some OSes limit HID interval to much longer than 1ms, up to 63.5ms! This is not a hardware limitation, it is a choice made by some OSes. I don't know which nor why, as I really only use Linux right now, and it respects the requested 1ms HID interval. It could even be an EFI BIOS detail, I guess?