The proper test is actually (((uintptr_t)src) ^ ((uintptr_t)dst) & 3). It is zero whenever 32-bit aligned access is possible, and nonzero when the copy is inherently unaligned (ie. even when src is aligned, dst is unaligned, and vice versa).
When the above expression is zero, (((uintptr_t)src) & 3) tells you the number of leading bytes that need to be transferred before an aligned transfer is possible. If your MCU supports unaligned accesses, then doing a single unaligned 32-bit copy, followed by aligning src and dst to the next 32-bit boundary (so between one and three bytes will be transferred again), is likely to be more efficient than eg. Duff's Device bytewise copy of the initial bytes.
Similarly, if there are any trailing bytes, doing a single unaligned 32-bit transfer to copy the trailing bytes, is probably more efficient than Duff's Device bytewise copy or a trailing copy loop, whenever unaligned accesses are supported (because the extra cost tends to be single digit cycles).
Of course, if the unaligned access cost is low enough, then just explicitly aligning one (the destination, assuming that it will be more likely to be accessed next rather than the source, caching-wise) and letting the other be unaligned, is even better because the test/selection logic before the copy is done has a significant cost compared to typical memcpy() sizes.
For explicit calls where you know the pointers and the data are 32-bit aligned –– even via
if (_Alignof(*src) >= 4 && _Alignof(*dst) && !(n & 3))
memcpy32p((uint32_t *)dst, (const uint32_t *)src, (uint32_t *)dst + ((uint32_t)n / 4));
else
memcpy8p((unsigned char *)dst, (const unsigned char *)src, (unsigned char *)dst + n);
where the third parameter is a the limit to dst, and not the number of units to copy –– for example as an inline wrapper around your copy function, can shave off the extra test/selection logic cost.