A dynamic-size stack would be pretty inefficient. Obviously depending on the memory layout, it could require being reallocated
No, that won't work. We do not know which of the values in the stack are addresses to the stack, so cannot do a fixup pass. In other words, the stacks must never move.
We can allocate new pages (and if using a bucket brigade approach, the local data stack does not need consecutive address space), and code can choose to check and release unused stack pages back to the kernel (shrinking the stack), but not move or reallocate it.
The overhead from using a dynamic hardware stack is from the SIGBUS/SIGSEGV signal on first access to each new stack page. Typical work done in the handler is one mprotect() call, and one mmap() call, plus an update to at least one TLS variable (keeping the current stack size; another variable is a set of flags, one of which indicates whether the stack can be grown by at least one page or not).
Perhaps a better way to think about it, is to compare to using a PROT_NONE mapping for the entire stack address space except for the initially used pages. Whenever a SIGBUS/SIGSEGV fires, we essentially do an mprotect() call, telling the kernel that it needs to switch that page from inaccessible (PROT_NONE) to backed by RAM (PROT_READ|PROT_WRITE) and populated.
(So why don't we do just that instead? Because no resources are really saved if one does it this way. You then do in userspace what the kernel would do automatically.)
The similarity to the fixed-size stack is that stack accesses up to the current stack size has no overhead. None; no difference. When the stack size exceeds that estimate, a SIGBUS/SIGSEGV handler gets evoked. The process itself implements it, and its purpose is to decide whether to kill the thread or to grow the stack. Without dynamically growing stacks, such policy can only be done at the function scope, checking whether the current stack pointer or stack frame is past some previously set limit; or, if the exact amount of stack needed by the next scope is known (i.e., this is possible to do in a compiler, not so much in ones own C without help from the compiler) checking whether the needed amount would put the stack beyond the previously set limit.
Furthermore, there is no real need to spend resources for the virtual address space mapping ("address reservation", as is done with PROT_NONE), genuinely freeing resources from rather large default stacks (8 MiB per thread typical in Linux on AMD64).
It may sound like it amounts to only a little overhead, but I do sometimes write code with dozens of worker threads in pools; hundreds, when using a dedicated thread per client. Address space is cheap but not free. Especially for service processes run on virtual hosts, I would rather trade a bit of performance for smaller memory requirements.
Because the SIGBUS/SIGSEGV handler uses `mmap(nextpage, page, PROT_NONE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_GROWSDOWN | MAP_STACK, -1, 0)` to ask for a new stack guard page just beyond/next to the current one, the kernel either provides it that address, not at all, or provides a different address (which will be rejected/freed, and the stack marked no longer growable), no existing mappings will be clobbered. Note the lack of MAP_FIXED flag; this is crucial.
I do not see ordinary code ever shrinking the stack; only a worker pool library implementation might (when returning a worker), but possibly not even those. I suspect it would be more efficient just to let ballooned workers die when they complete their current work, and create fresh new ones instead. However, exactly how large a thread stack would be allowed to grow is a
policy question, something that only the userspace process itself can and should decide; and this lets it do so at runtime, even changing its mind, instead of deciding a hard limit up front.
Let's estimate, then, the overhead for a real world case. We start with say PTHREAD_STACK_MIN (16,384 bytes on AMD64 in Linux) stack size, and during the lifetime of the thread, it expands to the default maximum of 8,388,608 bytes, in increments of single page (4,096 bytes in this particular case). That means an overhead of 2,048 SIGBUS/SIGSEGV signals, and about twice that in system calls. Is that a lot? No, it is not. To be easily detectable, it should happen within one second or so; anything longer, and it will be lost in the noise, and be very difficult to notice.
This is exactly why I wondered if an example implementation would help, because even trivial test cases would show the *real* overhead associated with the approach. Unless you have already implemented something similar, human intuition is unlikely to give an useful initial guess as to what the overhead in reality is.
(I have only done enough testing myself to have that as an *opinion*; but I know that even if I do a test case, if I am the only one who tests it, the results are still only enough to base ones own opinion on. Stuff needs to be rigorously investigated and pushed to the limit, and I think this kind of stuff needs more than one viewpoint to properly examine.)