Author Topic: function-call on protected stack (Read 2280 times)

DiTBho · « **on:** December 23, 2020, 06:17:16 pm »

Spin off from *the other* topic

DiTBho · « **Reply #1 on:** December 23, 2020, 06:18:05 pm »

Pure software solution (MIPS example)

MIPS offers these two instructions

Jump And Link
    jal addr:
        ra <= PC_next
        PC_next <= addr

Code: [Select]

Jump to Register
    jr $r
        PC_next <= $r

caller: jal addr can be used to invoke a function
called: jr $r can be used to return to the caller

This simple mechanism can use the stack to save the return address, and here MIPS doesn't provide any specific "stack-operations", so every register can be used for mimic push and pop behavior.

So, since you can use two different registers as "stack pointers", you can have to stacks: one dedicated to RA, one dedicated to function' parameters, function' local variables, and function' answers.

Code: [Select]

Load word (32bit)
    lw $t,c($s)
        EA = $s + c        
        $t = memory_read_u32[EA]

Code: [Select]

Store word (32bit)
    sw $d,c($s)
        EA = $s + c
        memory_write_u32[EA], $d

caller

Code: [Select]

    #as function' body stack pointer, fbosp, choose one of the available register
    #as function' *RA* stack pointer, frasp, choose one of the available register

    lw $p, #arameter0
    sw $p, #offset0($fbosp)
    ...    
    lw $p, #arameterN
    sw $p, #offsetN($fbosp)

    addiu $fbosp, $fbosp, #adjustment (sizeof(parameter0)+sizeof(parameter1)+ ....)
    jal addr

called

Code: [Select]

    # jal has already copied the return address into $ra, so save it into the dedicated stack
    sw $ra,#0($frasp)
    ...
    # it's time to return, reload the ra from the dedicated stack
    lw $ra,#0($frasp)
    jr $ra

DiTBho · « **Reply #2 on:** December 23, 2020, 06:22:23 pm »

This solution needs special Hardware

"RAS" stands for "Return Address Stack", sorry I am addicted to acronyms

dedicated RAS_area { begin, end }
dedicated RAS_pt pointer, hidden not visible to user, only the kernel can access it
dedicated RAS_push special store instruction, it implicitly operates on the hidden RAS_pt
dedicated RAS_pop special load instruction, it implicitly operates on the hidden RAS_pt
dedicated RAS_pt_mov protected instructions to manipulate RAS_sp, only the kernel can use them
dedicated RAS_area_mov protected instructions to manipulate RAS_area.*, only the kernel can use them
special logic to trap exception if the RAS area is pointed by common lw/sw IO instructions
special trap-exception in the list of possible exceptions that can be raised: except_RAS_area_violation

caller

Code: [Select]

     jal addr

called

Code: [Select]

     RAS_push       # save RA into the RA stack
     .... do stuff
     RAS_pop        # load RA from the RA stack
     J (RA)         # return

Code: [Select]

special instructions to be implemented

RAS_push:
     RAS_pt++
     RAS_sw RA, (RAS_pt)

RAS_pop:
     RAS_lw RA, (RAS_pt)
     RAS_sp--

special logic (only when CPU_MODE is "user" ):
     common-load/store use EA, Effective Address
          when load_IO
               if EA is within { RAS_area.begin ... RAS_area.end } then trap, otherwise ok
          when store_IO
               if EA is within { RAS_area.begin ... RAS_area.end } then trap, otherwise ok
     RAS-load/store (RAS_push/RAS_pop) use EA, Effective Address
          when load_IO
               if EA is within { RAS_area.begin ... RAS_area.end } then ok, otherwise trap
          when store_IO
               if EA is within { RAS_area.begin ... RAS_area.end } then ok, otherwise trap

Here you have common load/store, and dedicated protected stack-load/store

Only the kernel and exception ISRs can use RAS_sp_mov and directly access RAS_pt.
So in a multi-task-OS, RAS_pt can be saved as part of the task's context.

These needed to be added then to the task'context:

RAS_area.begin
RAS_area.end
RAS_pt

Nominal Animal · « **Reply #3 on:** December 24, 2020, 04:31:48 pm »

Motorola 6809 has two 16-bit stacks: S (system) and U (user).
The JSR instruction pushes the instruction pointer to the S stack, and jumps to the specified address.
The RTS instruction pops the instruction pointer off the S stack.

There are two push and two pop instructions: PSHS and PULS that operate on the S stack, and PSHU and PULU that operate on the U stack.

However, there are no restrictions on the use of S or U stacks.

If you have a flat memory model, with one data address space, then two stacks just add complexity without any real benefits. If you wish to use a fixed region for the stacks, you might think of having one grow upwards and the other downwards, but then, why not just use one stack for both?

If you use multiple regions for the stacks, you need to be careful in balancing their sizes. Let's call the data stack to address stack size ratio r. If r optimally reflects real world usage, then the ratio of the amount of data used in each stack is also r. If r is too small, you'll run out of data stack before you run out of address stack. If r is too large, you'll run out of address stack before you run out of data stack.

But, r is not a constant! It is completely dependent on the call chains, because different functions need different amounts of data stack, but only one slot in the address stack. So, there is always a tradeoff, an estimate on r, that means more memory reserved (for two stacks) than is actually needed, compared to the amount of memory reserved for one combined stack. How annoying!

On the other hand, the 6809 does not force anyone to use the U stack for data; the S stack can also be used for data. Then, the U stack can be the current data object/stream/whatever being manipulated. But that is not what this thread is about.

If you have paged virtual memory, you can reserve a lot of address space for each stack, but not back it by RAM before it is actually needed. If you have only 32-bit address space, this is a lot of reserved address space; potentially several megabytes off of 4096-megabyte address space. It might not sound much at first. (On 32- and 64-bit Intel architectures, most Linux distributions default to 8 MiB stacks per thread.) But what about when you have multiple threads running in the same process, each with their own stacks? In Linux, the default thread stack means that a 32-bit process is limited to around 130 threads, because there just isn't enough address space for more thread stacks! You can still have lots of RAM unused (or as page cache, which amounts to the same thing, really), but no continuous address space left.
The minimum working stack on x86 is 16,384 bytes (PTHREAD_STACK_MIN from <limits.h>), and a 32-bit process can easily have a thousand threads if needed, if the stack size is adjusted (pthread_attr_init(), pthread_attr_setstacksize(), and using the attribute set for creating threads with pthread_create()). (Indeed, in Linux, user and group IDs are per-thread properties – just "hidden" by the POSIX C API, and maintained across all threads using a dedicated POSIX realtime signal –, so if one uses freestanding code, that could even make sense for some particular network service. Right now, process parallelism is used however, which has its own benefits (separate address spaces!) and downsides (interprocess communication needed).)

Because of this, I don't think splitting the return address stack from the data stack makes sense, unless a segmented memory model is used (where the stacks are in completely separate memory address spaces, i.e. pointer p points to different data depending on which segment it is used for), or the address space is huge, at least 48-bit, preferably 64-bit, so that the address space has lots and lots more room than physical RAM, and one will always run out of accessible RAM before running out of address space to use that RAM with.

On an embedded system, say a microcontroller, it does not seem sensible to protect return addresses (or stack), because there is no real userspace there. It can be very useful to split privileged parts from unprivileged parts – for example, a web http/https service stack does not need access to anywhere but unprivileged buffers and the TCP/IP stack provided by the privileged part –, and even have a stack guard (in all function preambles in the unprivileged part), though. Personally, I'd love to have the ability for the privileged part to completely restart the unprivileged part. (Normally a reboot suffices, though, with acceptable effects for the human users.)

DiTBho · « **Reply #4 on:** December 24, 2020, 05:57:53 pm »

Quote from: Nominal Animal on December 24, 2020, 04:31:48 pm

Motorola 6809 has two 16-bit stacks: S (system) and U (user).
The JSR instruction pushes the instruction pointer to the S stack, and jumps to the specified address.
The RTS instruction pops the instruction pointer off the S stack.

Even 68k works this way

Two stacks, the same behavior about JSR and RTS, but then everything is on the stack, parameters, all the function' local variable, the return values, and the return address, and this is what you don't want to have because it's "buffer overflow" exploitable.

The only solution for these architectures is to copy what JSR pushes on the stack into a second stack, and gets it just before RTS.

DiTBho · « **Reply #5 on:** December 24, 2020, 06:14:18 pm »

68K software solution (workaround?)

Code: [Select]

JSR Jump to subroutine 
   Operation: 
   SP <- SP - 4; 
   Mem[SP] <- PC
   PC <- function_address

caller

Code: [Select]

     jsr addr       # this auto magically pushes RA into the stack

called

Code: [Select]

     MOVE 4(SP),D1       # save RA into the RA stack
     MOVE D1,(RAS_P)+    # it uses D1 as temp,
                         # RAS_P can be A0,A1,..,A6
                         # A7 is always the SP

     .... do stuff

     MOVE (RAS_P)-,D1    # load RA from the RA stack
     MOVE D1,4(SP)       # save RA into the stack
     RTS                 # return

DiTBho · « **Reply #6 on:** December 24, 2020, 06:38:53 pm »

Quote from: Nominal Animal on December 24, 2020, 04:31:48 pm

If you use multiple regions for the stacks, you need to be careful in balancing their sizes

It is a serious problem. At the moment I have reserved space only for a depth of 32 function calls per task.

Quote from: Nominal Animal on December 24, 2020, 04:31:48 pm

Because of this, I don't think splitting the return address stack from the data stack makes sense

I am open to other solutions.

Quote from: Nominal Animal on December 24, 2020, 04:31:48 pm

On an embedded system, say a microcontroller, it does not seem sensible to protect return addresses (or stack), because there is no real userspace there.

The problem here is the "buffer overflow"

It can happens in kernel space as well as in user-space. I have here a micro-controller (without any MMU/TLB) running uc/OS2 as kernel, and a tiny Web-server as main task. Well, it's not robust code but rather weak, so technically it can for sure experiment a malicious-exploit when a function that handles an HTTP request doesn't check parameters and this is a serious vulnerability when the request is made just to deliberately cause a buffer overflow in order to cause the RA overwritten by something that will be used as Return Address when the function returns.

edit:
Please note, in the above hardware solution, the RAS_push and RAS_pop mechanisms add privilege levels per-instruction basis, and it's already fully implementable without requiring special memories.

Nominal Animal · « **Reply #7 on:** December 24, 2020, 09:55:43 pm »

Quote from: DiTBho on December 24, 2020, 06:38:53 pm

The problem here is the "buffer overflow"

Yup; and deliberate overwrite of the return address is a known exploit scheme.

Separating the return addresses into a separate stack that is not easily accidentally overwritten is definitely something I'd like to see (but with the constraints I mentioned above).

Using a stack that grows upwards (increasing addresses) would help a lot here, because then the return address is at a lower address than the temporary variables and arrays; buffer overrun then does not overwrite the return address, but the unused area of the stack (and whatever follows the stack).

A big reason for buffer overruns being so common in C code is because a lot of the C standard library functions have idiotic interfaces. Hell, standard C still doesn't have getline(), strdup(), strndup(), etc.

An embedded RTOS, or any code running in a freestanding environment (no standard C library available, only features described by <float.h>, <iso646.h>, <limits.h>, <stdalign.h>, <stdarg.h>, <stdbool.h>, <stddef.h>, <stdint.h>, and <stdnoreturn.h>; and compiler-provided C extensions), like say the Arduino environment for example, is free to implement its own.

For specific things like serial input processing, a coroutine consuming input byte by byte, and parsing and processing it as it comes in, would work fine on AVRs. But, the C model really doesn't provide a good interface for those. So, buffers are usually used instead.

(If you want, try writing a C function to parse input, with the function called once whenever a new byte is available, with that byte as a function parameter. So, basically like a finite state machine, where each call provides the new input value, so you can do a state transition, and optionally produce output – say, in the form of calling a function. It can be done, sure; but it's not something you'd expect a C learner to be able to do correctly.)

If you want mutable strings or data buffers in C, without risking buffer overruns, I like to use

Code: [Select]

struct area {
    size_t         size;
    size_t         used;
    unsigned char  data[];
};

where data[] contains the actual data in a C99 flexible array member; size is the size of that array; and used is the number of bytes of data in the initial part of the array. This is very similar to Pascal strings, where the length of the string is just before the data; except I have size, length, and then data.

For I/O buffers, I like to use

Code: [Select]

struct iobuffer1 {
    size_t         size;
    size_t         head;
    size_t         tail;
    unsigned char  data[];
};

struct iobuffer2 {
    size_t         size;
    unsigned char *head;
    unsigned char *tail;
    unsigned char *data;
};

where head is the index or pointer to the first byte in the buffer, tail is the index or pointer just past the last byte in the buffer (with head == tail indicating an empty buffer). With these, the only buffer issues are off-by-one errors, especially when parsing e.g. escape codes and the escape mark is the last byte buffered. Those I avoid by comparing the current parsing position to the buffer tail, with as many "extra" bytes allocated to data[] as the longest escape sequence is, and being able to cancel parsing and retry when there is more data available.

Quote from: DiTBho on December 24, 2020, 06:38:53 pm

It can happens in kernel space as well as in user-space. I have here a micro-controller (without any MMU/TLB) running uc/OS2 as kernel, and a tiny Web-server as main task. Well, it's not robust code but rather weak, so technically it can for sure experiment a malicious-exploit when a function that handles an HTTP request doesn't check parameters and this is a serious vulnerability when the request is made just to deliberately cause a buffer overflow in order to cause the RA overwritten by something that will be used as Return Address when the function returns.

Have you read any Apache or APR sources? Instead of heap, it uses allocation pools. There is one "global" pool, but each request gets their own pool, and any memory that needs to be allocated, is allocated from that pool. (It's not a pool in the sense that it has limits; it's more in the sense that all allocations within that pool can be freed with a single call.) This does not help with buffer overruns, though? Well, it does, because no array-like data ever lives on stack; they're always allocated from the pool instead.

Similarly, we don't really need a stack for function-local data. It is just one of the possible ways of managing the memory needed by functions for temporary or more longer-duration data. It's just that a stack is an obvious choice when you consider the complexity of implementing an allocator.

Let's say you had a very fast memory allocator, and the stack is only used for storing scalar data, no arrays.
If all arrays had to be dynamically allocated, then using something like struct iobuffer2 above would require four pointers (three pointers and one unsigned size), and "standard library" functions could check for array boundaries.
Thus far, such dynamic allocators tend to be considered too "slow" for general purpose stuff.
(Yet, it is exactly the direction e.g. the Linux kernel is going.)

See how it comes round to memory management schemes? Well, at least for me. This is why I'm looking into segmented memory models, i.e. where one has some sort of special base pointer (maybe just aligned in memory), and offsets and sizes with respect to that base pointer; and the hardware can detect buffer overruns (by simply making sure that the effective offset with respect to a given base pointer is within its size). Return address stack would be only one particular use case, and if it was only accessible to CALL/RET instructions, ROP exploits and buffer overruns affecting function return address would be impossible. It's just that passing three or four pointer- or pointer-size variables per address/variable is a lot of overhead; there must be a pattern that reduces the overhead, especially since typical code only juggles between a small number of separate "objects" (or variable scopes). If the segment is hardcoded to the assembly instruction, like on x86 and x86-64, you'd need a separate function for each combination of reference segments. If, however, the segment was coded to the "base pointer", then one function would suffice.

DiTBho · « **Reply #8 on:** December 24, 2020, 10:23:49 pm »

WOW great post, a lot of info, thank you!

Quote from: Nominal Animal on December 24, 2020, 09:55:43 pm

Have you read any Apache or APR sources?

Not yet, my HTTPd is very simple and primitive. I don't remember how many kbytes its code takes exactly, but the whole firmware takes less than 32Kbyte of ROM coupled with an RTOS, a TCP/IP library, and a few other tiny things. I know it because the Keil-ARM7-toolchian has this limit in evaluation mode.

This is also the amount for BRAM I have on my FPGA: less than 64Kbyte for RAM and ROM.

Apache is up and running on my Linux server, I have never looked at its source. I remember when I compiled it it took 2 hours, so I think it's much more advanced and reach of features and mechanisms.

I will look for the "allocation pool" concept, that's very interesting!

Nominal Animal · « **Reply #9 on:** December 24, 2020, 11:17:24 pm »

Apache Portable Runtime is a bit bloaty, but pretty reliable set of utility functions; definitely interesting stuff to look at and consider. The bucket concept APR implements is also very useful.

You can do a minimal pool allocator implementation on top of standard C with

Code: [Select]

struct pool;

struct chunk {
    struct pool  *pool;  /* Pool this belongs to */
    struct chunk *next;  /* Next chunk in this pool */
    size_t        size;
    unsigned char data[];
};

struct pool {
    struct pool  *pools;  /* Optional: For subpools */
    struct chunk *chunks;
};

using the following functions on top of malloc() et al:

Code: [Select]

struct pool *pool_new(void);
struct pool *pool_subpool(struct pool *parent);
void         pool_free(struct pool *pool);

void *pool_get(struct pool *pool, size_t size);
void *pool_drop(void *mem);
void *pool_resize(void *mem, size_t size);
void *pool_move(struct pool *to, void *mem); 
void *pool_optimize(struct pool *to, void *mem, size_t size);

pool_new() and pool_subpool() just allocate a new pool structure, and initialize it. pool_subpool() adds the new subpool to the parent->pools chain.
pool_free() frees all pools hanging off the ->pools, and all chunks hanging off its ->chunks, as well as the ->chunks of the subpools. Simple two-loop iterative function, unless you have multiple threads, when it becomes highly complex.

pool_get() is like malloc(), but reserves memory for the struct chunk just before the data also. Basically, char *ptr = malloc(size + sizeof (struct chunk));, then returns ptr + sizeof (struct chunk), i.e. ((struct chunk *)ptr)->data. The chunk is added to the ->chunks chain of the specified pool.
pool_drop() is like free(), but removes the chunk from the owner pool ->chunks chain. Note that struct chunk *ch = (struct chunk *)((char *)mem - sizeof (struct chunk));. pool_drop() is rarely used.
pool_resize() first has to remove the chunk from the owner pool ->chunks chain. Then, it can use realloc() to resize the chunk. If successful, the chunk (with a possibly changed address) is re-added to the owner pool ->chunks chain. If realloc() fails, the old allocation is still valid, and the chunk should be added back to the owner pool ->chunks chain anyway. This way, the ->chunks chain never contains pointers to invalid chunks. If you only update the chain afterwards, you cannot have threadsafe structures even with locking. (As these are now, you get thread safety by adding a mutex to each pool structure, and grab it before traversing either chain.)
pool_move() just removes the chunk from one ->chunks chain, and adds to another.
pool_optimize() combines reallocation and moving. You remove the chunk from one ->chunks chain, then reallocate it, and if successful, add it to the to->chunks chain; if realloc fails, you again need to add the original back to its original ->chunks chain.

A service can have a pool for each connected session, with each request (within that session) having their own pool. That way, if the session is destroyed, the request pools are destroyed also.

In multithreaded processes, pool_free() is a bit more complicated. Usually, its chains are added to a garbage chain, and not immediately freed, because other threads may be holding a pointer to one of the objects in the pool. Because objects are not individually freed, reference counting will not help. You can only destroy the chunks in the garbage chain when you know no thread is holding a temporary pointer to its contents. If you use a worker pool of threads, then you can mark each garbage chain with a timestamp, and have each worker thread record their timestamp when they complete ongoing work and go back to the worker pool. When recording that timestamp, you can check the other threads' timestamps, and find the oldest. All garbage pools with an even older timestamp can be safely destroyed.

DiTBho · « **Reply #10 on:** January 11, 2021, 12:40:27 am »

Today I needed to recompile gcc-4.4.3 because used in an old project and I added "Mudflap" to the building script

Quote

The mudflap run time checker was removed in GCC 4.9 and it is superseded by Address Sanitizer. The mudflap options remain, but do nothing. This page applies only to earlier versions of GCC.

Mudflap Pointer Debugging

It is enabled by passing -fmudflap to the compiler. For front-ends that support it (C and very simple C++ programs), it instruments all risky pointer/array dereferencing operations, some standard library string/heap functions, and some other associated constructs with range/validity tests. Modules so instrumented should be immune to buffer overflows, invalid heap use, and some other classes of C/C++ programming errors. The instrumentation relies on a separate runtime library (libmudflap), which will be linked into a program if -fmudflap -lmudflap is given at link time. Run-time behavior of the instrumented program is controlled by the MUDFLAP_OPTIONS environment variable.

see here


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: function-call on protected stack (Read 2280 times)

DiTBho

function-call on protected stack

DiTBho

pure software solution

DiTBho

This solution needs special Hardware

Nominal Animal

Re: function-call on protected stack

DiTBho

Re: function-call on protected stack

DiTBho

Re: function-call on protected stack

DiTBho

Re: function-call on protected stack

Nominal Animal

Re: function-call on protected stack

DiTBho

Re: function-call on protected stack

Nominal Animal

Re: function-call on protected stack

DiTBho

Re: function-call on protected stack

Share me