It is very interesting (but unclear) to me exactly why most C programmers – myself included – prefer
rettype funcname(elemtype *ptr, size_t len);over
rettype funcname(size_t len, elemtype buf[len]);even though the only difference in machine code is the order of parameters; but the latter API pattern allowing much better buffer access checking at compile time, even through deep call chains (each call limiting to a smaller sub-array), helping catch buffer underrun/overrun errors.
The easy answer is
inertia (or habit or familiarity or because everyone else does it that way too), but I'm not sure it is the whole answer.
Isn't it interesting how rarely anything like this (arrays-not-pointers) is suggested for "the next C", even though memory or buffer over/underrun bugs are the most common issues in C code?
replace pointers with arrays as the base memory reference type
Why would that be better?
It makes it possible for the compiler (with current gcc and clang static analysis/warning capabilities) to verify buffer accesses are valid. (That is, the compiler knows at compile time if each access is "valid" (safe, within the buffer), "invalid" (overrun/underrun), or "undetermined"; with the last one only affecting code that uses pointers or tricky indexing math whose limits are unknown at compile time.)
Ah! Of course, I was stuck in pointer mode thinking the array would be passed as a pointer and just look like an array to the programmer. But I see now that's not the idea
Yep. It's more like a cultural change than a technical one, even though its purpose is purely technical: help with compile-time static analysis wrt. buffer accesses.
That's more or less akin to always using the base pointer to an allocated block (rather than accessing it through a pointer that could point arbitrarily inside, or even outside of it) and some index for accessing its content. You also need to store the size.
Actually, what I want is for the compiler to be aware of the size whenever it is known at compile time.
If you consider the two
funcname() definitions at the beginning of this post, you can clearly see the difference between the pointer and the array approach. This difference is the critical one; it is
not about adding explicit size information to interfaces that currently use a pointer only. (Except for memory allocation functions: these should return both the allocated size and the base address, instead of just the base address. This would actually be desirable in many grow-as-needed use cases, considered completely separately. Oh, and possibly the string functions, which deserve to be redesigned anyway.)
That's something you can always do in pure C though, even if that means a bit more programming overhead and possibly a bit less opportunity for optimization (even though that would remain to be seen in practice.)
Note that the change would not cause any change to runtime code, no inherent additional runtime memory or CPU overhead at all.
Many string functions would actually add an explicit size parameter (ABI-wise), but I consider that a plus (and a deficiency in current standard C library string functions). I've discussed the related issues especially in embedded environments before; let's just say that string handling can be done much better (faster, more reliably) even in current C than what the standard C library provides.
Passing an array forwards is trivial even in current C (since C99), although the size of the array must be before the array in the parameter list, but
receiving an array from a function call is not supported. Thus far, in my experiments I've simply assumed syntax "
elemtype arrayname[sizetype count] = ...;" (declaring two variables at once, initialized by a single function call returning both the base pointer and the size, with the size divided by the element size to obtain the count "automagically"), but I'm sure better syntax can be devised.
The basic idea is to create a type for your 'arrays', something like this:
typedef struct
{
BaseType *Block; // pointer to your memory block, from static or dynamic allocation
size_t n; // current number of items of type 'BaseType' in the memory block
size_t nMax; // max number of items of type 'BaseType' in the memory block
} Array_t;
Yes, I use this pattern extensively. For some reason, I use '
used' for the current number of items, and '
size' for the maximum number of items, and '
item' for the pointer or C99 flexible array member. It is very common to see a variant of
typedef struct { size_t size; size_t used; elemtype item[]; } elem_array;in my code.
Indeed, whenever this information is already available, why don't we tell the C compiler about it, so it can help check the array boundaries for us at run time?
This is the core of this suggestion. Not to add size and/or used to everywhere (except functions that in my opinion should have had the size from the beginning even in the standard C library), but to help the compiler understand better exactly what us humans intend, and help catch our thinkos at compile time.