Just ot add I have done the above as Ian M mentioned - '595 with '165, some sot23-5 single gates to match the clock phases, and a tristate buffer for the MISO line just to make it reasonably compliant with SPI, and frankly, its just not worth it.
Sure HC/VHC/LCX logic is cheap, but once you start having to add small gates, the board area gets bigger, more to route, and its just easier to use a $0.25 MCU with a hardware SPI port, and a small app that just configures and reads/writes to its pins - thats essentially what the port exapnders are. As long as you realise that it'll take time for the MCU to read its input port when you bring the CS line low - and as such, have sufficient delay before you start a transfer, its a cheap and quick (and small) way of solving the problem, at the cost of spending a couple of hours writing and verifying the code for it.
side note: I tend to use SPI because I hate I2C with a passion. It's a bias I have that I have come to accept.
Edit: After reading the first post more carefully, and the replies, if you're willing to "bit bang" then you can do funky things with a 4094/595 and a 165, with 3 IO - clock, in, and out, with the clock line controloling the strobe/latch of the shift registers via RC circuits. It can't be used with an SPI port since you'll have to do funky things with the clock line to latch the data, but its 3 IO, two IC's, and a bunch of passives.