Someone pointed out that a memcpy implemented as 4 byte moves will fail if the gap between the two buffers is less than 4 bytes, which is why, supposedly, memcpy is normally done in byte mode.
Can anybody relate to this? I don't get it at all.
I don't get it either. A properly working memcpy implementation should neither read bytes outside the source buffer nor write bytes outside the destination. If this requirement is satisfied, it should not fail as long as the buffers don't overlap (which as you say is not supported, and requires memmove). Some implementations do allow reading past the end of the source, as long as no architectural boundaries are crossed (i.e., the over-read can't cause a fault), typically to the end of an aligned word. But since the extra bytes won't be written, it doesn't matter whether they are new or old data.
If the source and destination buffers are disjoint, but share the same 4 byte word, then they can't both be 4 byte aligned. The reason simple memcpy implementations work in byte mode is entirely for alignment purposes. memcpy doesn't place any alignment requirements on the buffers or the sizes. So memcpy implementations that want to use larger word sizes need to handle buffers that are not aligned, lengths that are not a multiple of the word size, different alignment between source and destination, and do so without causing faults or severe misaligned access penalties.
Optimized memcpy implementations obviously *do* check the sizes and alignments and try to use larger chunks when possible.