The A53 in the Pi 3 runs 64 bit code about 20% faster than 32 bit code.
This seems odd to me. It is likely that the instruction set in 64 bit mode is optimised better compared to an older 32 bit instruction set. Also keep in mind that on 64bit ARM an integer in C is still 32 bit!
With the exception of PUSH and POP (or LDM/STM) with more than two registers, A64 programs tend to use fewer instructions than A32 or T32 programs. The code size between A32 and A64 is often not much different, with A32 smaller in function entry and exit code, but A64 smaller in loops (and so executing fewer instructions overall if the loops iterate a few times).
It's certainly possible that some 64 bit CPU core might implement 32 bit instructions less efficiently due to the pipeline being optimized for only what the 64 bit instruction set needs. For example it's very likely that PUSH/POP/LDM/STM might execute at 1 clock cycle per register (as in fact 32 bit cores do), while 64 bit LDP/STP run at 1 clock cycle per *pair *of registers. It's even possible a 64 bit core might take a clock cycle for every register, whether it is included in the mask or not, at least until the mask becomes 0. As A64 doesn't have/need predication, it's possible A32 predicated instructions and T32 IT* instructions might be executed in the pipeline as a conditional branch for every predicated instruction. I can't recall off-hand whether every A32 addressing mode is also present in A64. If not then some might be split into 2 uops on some 64 bit cores.
From the ARMv8-A manual:
"ARMv8-A deprecates some uses of the T32 IT instruction. All uses of IT that apply to instructions other than a single
subsequent 16-bit instruction from a restricted set are deprecated, as are explicit references to the PC within that
single 16-bit instruction. This permits the non-deprecated forms of IT and subsequent instructions to be treated as a
single 32-bit conditional instruction."
To me this suggests ARM intends the execution pipeline (on cores that support both 32 bit and 64 bit) to efficiently handle this restricted set of uses of IT, but other uses may be lower performance or even, after some time, not supported at all.
As far as I can tell, the A53 core executes 32 bit code as efficiently as possible (not worse than an A7, for example). But I think STP/LDP in function entry/exit code are faster than PUSH/POP (even if they are bigger code), and A64 executes fewer instructions in a number of other situations.
The size of a C integer depends on what OS / ABI / compiler you are using. It's true that most 64 bit systems have settled on an "LP64" model (Long and Pointer are 64 bits, Int remains 32) with fewer adopting ILP64 and Windows using LLP64 (Both int and long are 32 bits, you need to cast pointers ti use long long for pointer arithmetic).
There are also ILP32 environments on 64 bit instruction sets which allows you to make use of more registers and a more modern instruction set on x86 and ARM 64 bit without increasing memory usage of data structures containing pointers, if 4 GB is enough for a single program. The x32 ABI on Linux is an example. The arm64 processor in the Apple Watch is used with an ILP32 ABI also.
The hardware doesn't care.
As long as you use types such as size_t, ptrdiff_t, uintptr_t in your C code then the programmer also doesn't care whether a pointer needs to be cast to int, long, or long long.