1. Why XOR each random byte with a previous one from X bytes ago, in the case of this example, 512 bytes ago. What purpose does this serve?
In bit stream generators like this, it is customary to 'scramble' the hardware genereated bits in order to mask any correlation that might show up. The general idea is that the hardware side is not truly random, but only random to some degree, an imperfect entropy source. An example is /dev/random in linux: it takes entropy timing intervals between key presses and other such events, and then passes those bits through a strong (for the time I reviewed the code, years ago) hash function, like SHA-2. The result is a highly scrambled sequence of bits, not fully random, but good enough for most purposes.
In this case, I think the author tried to reduce correlation between succesive bits xoring the current output with some long past output, in order to reduce transient artifacts.
2. He uses a technique of running assembly that bumps a counter and checks to see if the input pin from the hardware RNG has changed. If it has, then we uses the LSB of that counter (which also happens to divide it by 2) as the random bit. So, a signal change means grab a bit from the timer. He isn't using a real AVR timer, but a register in his loop that he is incrementing.
This is more like /dev/random as I remember it. Use a discrete counter to time the difference between two events, and then discard all the bits save the least significant one (the most entropic, since it's updated more often. The greater the number of bits between events, the better.) This doesn't mean that last bit is truly random; it's just apparently the most random of them all, from an information theory viewpoint.
Do each of these things effectively become some type of filter? For whitening? Something else?
Think of the bit generator not as truly random, but as an entropy source of varying quality. Sadly, it's not as easy as discarding the wrong bits and keeping the good ones. The correlation is distributed among all the bits, and each one is truly random only to some degree. An intelligent attacker could predict patterns in the raw output, and build an attack from there. Since you can't discard the bad bits, at least you can mask the correlation toroughly hashing the bit stream with a strong cryptographic function. The xor mechanism is not thorough at all, but it's something: a LFSR with only one element. A longer, more thorough LFSR using several past inputs irregularly distributed would be better, and not that difficult to implement.
From a hardware point of view, I don't like this circuit. As I understand it, it depends on the noise on the power rail and the inverting input of the comparator to generate the timed events. A good part of that noise might come from clocking and I/O events in the logic ICs, which are not very random, or not random at all. The author also writes that the circuit is vulnerable to interference if not enclosed. Though I only browsed it quickly, the diehard test provided didn't seem that bright. Anyway, without a substantial investment of time, I coudln't say if this scheme is really good or not.