The hazard of course is, cooperative requires cooperation, and you can't guarantee that with buggy or arbitrary code. For example, can you guarantee that all code paths and loops through a given function/module terminate within a reasonable time frame? I certainly can't, not that I'm a particularly bad coder or anything, but even small, well-crafted examples of the Halting Problem (taking practical functions over pathological extremes) are very rapidly nontrivial to prove.
Cooperation is also fairly trivial to implement, you don't need to smash anyones' stack, everything can be cleaned up neatly before return (yield() is more of a RET ... PROC entryPoint (...) syntactic sugar, than a function call as such -- an example of an inverted hierarchy); you can implement this in pretty much any language, even, you don't need ASM to glue it together, or interrupts or anything. It's great, when it works... but you can't guarantee that it works *in general*.
Meanwhile, the user wants to see a response, some time this...century y'know, so inevitably a thread must be stopped, shelved on the stack, and context switched over.
As I recall, this was a major complaint of early Macs, that were supposed to be cooperative but inevitably certain user programs would hang the whole system and become unresponsive, whether briefly (longer thread time than specified/desired) or pathologically (infinite loop, have to break or reset).
A yield() is definitely a nice-to-have in a system -- or just a wait(tics) or what have you -- but it doesn't need to be implemented differently, and adds more complication to do so. That said, there's still some justification: it can be more efficient, the thread managing its own stack at-will to save on RAM+CPU overhead by the kernel sometimes (most of the time, even?). To take that advantage, you'd need a flag to say how it was halted and thus how to resume, and, if you're allocating memory from a pool during halt, you could save the overhead of the standard stack frame and all that and use a smaller header, maybe it's allocated from a different array (fixed size allocations, but the thread object could be x or y size so pulls from pools {X} or {Y}), or arbitrarily sized and just put on the heap wherever; but mutating the size of an already-allocated thread object in-situ would seem a non-starter, and you'd need a much more dynamic system to take advantage (say, maybe there's an array of thread IDs, type/flags, and pointers to them; and the pointers can be at members of {X}, or {Y}, or {heap}). Or also since we're talking such a limited platform: whether pointers are stored in extended memory, on what page(s), or via other means of access (HDD paging, networking, etc., depending on how deep and modern you want to go with it, lol).
Tim