Author Topic: Need some advice on threading on a Z80 (Read 11842 times)

tggzzz · « **Reply #50 on:** August 19, 2024, 09:42:59 pm »

Consider avoid having the thread request more stack. Instead have a data structure that defines a thread's entry point, stack size, and any other resource requirements. Determine whether resources are available; if not don't bother to create the thread.

Personally I like only three thread priorities: normal, interrupt, and emergency panic.

Having multiple "normal" priorities causes priority inversion problems, and invites twiddling priorities to "get it to ~~work~~ not fail (yet)".

Comms between threads is best achieved with messages+queues, not with fork/join and similar. That enables easy debugging and operation monitoring. Interrupts also communicate with threads using messages.

I also like cooperative scheduling, where each thread reaches a convenient point and then calls yield(). That does require applications to be structured sympathetically, but fits naturally with event-based and finite state machine based architectures.

JamesIsAwkward · « **Reply #51 on:** August 20, 2024, 01:22:41 pm »

Quote from: tggzzz on August 19, 2024, 09:42:59 pm

Consider avoid having the thread request more stack. Instead have a data structure that defines a thread's entry point, stack size, and any other resource requirements. Determine whether resources are available; if not don't bother to create the thread.

This is exactly what I was thinking! The executable header could have the resource requirements and then the kernel could decide if it has the spare resources to run it.

Quote

Personally I like only three thread priorities: normal, interrupt, and emergency panic.

Having multiple "normal" priorities causes priority inversion problems, and invites twiddling priorities to "get it to work not fail (yet)".

Yeah I think you may be right here. I like the idea of prioritizing memory bottlenecked threads over I/O threads but it would add a lot of complexity to my thread handler. I'm a big fan of getting a base system running then add more advancements later. I'll keep it simple for now lol

Quote

Comms between threads is best achieved with messages+queues, not with fork/join and similar. That enables easy debugging and operation monitoring. Interrupts also communicate with threads using messages.

This is where my knowledge totally breaks down. I need to do more reading on inter-thread communication to figure out how to handle this. Would you care to explain the high level
idea behind messages and queues vs implementing a fork system?

tggzzz · « **Reply #52 on:** August 20, 2024, 03:05:57 pm »

Quote from: JamesIsAwkward on August 20, 2024, 01:22:41 pm

Quote from: tggzzz on August 19, 2024, 09:42:59 pm
Comms between threads is best achieved with messages+queues, not with fork/join and similar. That enables easy debugging and operation monitoring. Interrupts also communicate with threads using messages.

This is where my knowledge totally breaks down. I need to do more reading on inter-thread communication to figure out how to handle this. Would you care to explain the high level
idea behind messages and queues vs implementing a fork system?

That would take a book!

Start by thinking at a high level way of describing what your application has to do, and how it does it. Structure your architecture to reflect that. Find design patterns that match your architecture's concepts. Often events, actions and FSMs are a good start. They work at many levels from hardware logic to telecom billing systems.

Understand the concepts in UML State diagrams (Harel StateCharts), and Sequence diagram. You don't need to use all their features in their full glory! Just use the simple bits.

Understand the concepts in Hoare's Communicating Sequential Processes, as implemented in Transputers/xCORE/TMS320, Occam/xC/Go etc.

Understand the concepts in Apples ~~"Grand Unified Junction".~~ EDIT: Grand Central Dispatch.

Events are messages that message sources put() into named FIFOs. Examples: "hotter button pushed", "set power to X", "new value of temperature is Y", "switch to emergency mode". Other threads get() messages from named FIFOs, and do whatever actions are required based on the message contents. For example, on receiving a "new temperature" message, a temperature controller might recalculate the power required and send a "set power" message to the output device.

FIFOs can be one deep (often called message/mail boxes) or a fixed depth.
If a FIFO is empty and a get() is attempted, one variant stalls the thread until a message is available, another returns immediately indicating nothing was extracted.
If a FIFO is full and a put() is attempted, one variant stalls that thread until the message has been inserted, while another returns immediately indicating it wasn't inserted. The former is effectively "waitUntil()" operation, the latter could be used to implement a waitUntil(X or Y or...) operation.

brucehoult · « **Reply #53 on:** August 20, 2024, 03:25:15 pm »

Quote from: tggzzz on August 20, 2024, 03:05:57 pm

Understand the concepts in Apples "Grand Unified Junction".

Grand Central Dispatch?

tggzzz · « **Reply #54 on:** August 20, 2024, 04:02:22 pm »

Quote from: brucehoult on August 20, 2024, 03:25:15 pm

Quote from: tggzzz on August 20, 2024, 03:05:57 pm
Understand the concepts in Apples "Grand Unified Junction".

Grand Central Dispatch?

<mutter>Bollocks</mutter>

Useful correction. I knew what I meant, honestly!

JamesIsAwkward · « **Reply #55 on:** August 20, 2024, 05:21:21 pm »

man I really appreciate the help here! I'm going to chew on this for a while and start reading!

tggzzz · « **Reply #56 on:** August 20, 2024, 05:55:32 pm »

Quote from: JamesIsAwkward on August 20, 2024, 05:21:21 pm

man I really appreciate the help here! I'm going to chew on this for a while and start reading!

It is a pleasure to help someone that formulates decent questions, listens, thinks and understands.

That isn't as common as we would like

JamesIsAwkward · « **Reply #57 on:** August 21, 2024, 03:06:31 pm »

Well I'm glad my questions are keeping you sharp!

I think I like your idea of co-op threading... how do you feel about this approach:
I scrap my current thread_handler, and repurpose the 60hz timer on the INT pin to fire a routine that increments a 3-4 byte section of memory. The ZX spectrum does this with its vblank signal and updates a 24bit section of memory to keep track of time.

Then, like you suggested, I can have my threads yield() after their main() loop reaches the end (that I will have to consciously write to be as quick/small as possible).
When this happens, the thread can check to see if a certain amount of elapsed time has passed (16ms maybe?), and if so it will gracefully yield to the thread handler. If the time allotment hasn't passed it will have the option of running its main() again.. or yield if it is done.

This means if a program/routine needs to do something that will likely take more time than 16ms or whatever I set, I will need to write the loop in a way that can do it in pieces and in a thread safe manner...

Pair this with a message queue, like you also recommended lol, then the thread handler can still be round-robin style but the threads can check to see if they have the message they are waiting for, and if not then they can just yield() again. I'm only planning on doing 8-16 threads max so the process of iterating through them and checking should be fast enough that it wouldn't matter.... I think?

Also I'm still thinking of my thread status flags, I'm thinking maybe "running", "suspended", and "queued" flags? Queued being the thread is ready to start but hasn't been started yet? I'm not sure about these yet though. Though I'm pretty certain on the "running" flag of course haha

That sound like a decent plan or have I missed something?

JamesIsAwkward · « **Reply #58 on:** August 21, 2024, 04:46:16 pm »

Also to add.. while I'm coding this it doesn't feel as "elegant" as the kernel handling all of this.. I mean its not much to add a couple of lines to call routines that check the elapsed time or yield but it doesn't feel as "nice" haha

tggzzz · « **Reply #59 on:** August 21, 2024, 05:34:47 pm »

My preference is that a thread should run for as long as it deems desirable, and then yield(). That leaves the thread's application logic to state when it can safely be "interrupted" by another thread. There is no need for the RTOS or compiler or library to have an concept of thread safety.

Having a specific tick size smells of pre-emptive non-cooperative threading.

Consider just having your threads like this...

do something
getBlocking( fifoA )
do something else
getBlocking( fifoB )
etc

where getBlocking( aFifo ) implicitly yields, and tells the RTOS to only return when aFifo has something in it.

A generalisation would be whichFifo=getBlockingSelect( fifo1, fifo2 ), which tells the RTOS to only return when at least one of the FIFOs has something in it, and let the thread know which fifo is not empty.

Another alternative is tick+getTime(); getBlockingOrTimeout( aFifo, tick+5 ), which will return if either aFifo contains something or getTime() > tick+5.

T3sl4co1l · « **Reply #60 on:** August 22, 2024, 03:25:04 am »

The hazard of course is, cooperative requires cooperation, and you can't guarantee that with buggy or arbitrary code. For example, can you guarantee that all code paths and loops through a given function/module terminate within a reasonable time frame? I certainly can't, not that I'm a particularly bad coder or anything, but even small, well-crafted examples of the Halting Problem (taking practical functions over pathological extremes) are very rapidly nontrivial to prove.

Cooperation is also fairly trivial to implement, you don't need to smash anyones' stack, everything can be cleaned up neatly before return (yield() is more of a RET ... PROC entryPoint (...) syntactic sugar, than a function call as such -- an example of an inverted hierarchy); you can implement this in pretty much any language, even, you don't need ASM to glue it together, or interrupts or anything. It's great, when it works... but you can't guarantee that it works *in general*.

Meanwhile, the user wants to see a response, some time this...century y'know, so inevitably a thread must be stopped, shelved on the stack, and context switched over.

As I recall, this was a major complaint of early Macs, that were supposed to be cooperative but inevitably certain user programs would hang the whole system and become unresponsive, whether briefly (longer thread time than specified/desired) or pathologically (infinite loop, have to break or reset).

A yield() is definitely a nice-to-have in a system -- or just a wait(tics) or what have you -- but it doesn't need to be implemented differently, and adds more complication to do so. That said, there's still some justification: it can be more efficient, the thread managing its own stack at-will to save on RAM+CPU overhead by the kernel sometimes (most of the time, even?). To take that advantage, you'd need a flag to say how it was halted and thus how to resume, and, if you're allocating memory from a pool during halt, you could save the overhead of the standard stack frame and all that and use a smaller header, maybe it's allocated from a different array (fixed size allocations, but the thread object could be x or y size so pulls from pools {X} or {Y}), or arbitrarily sized and just put on the heap wherever; but mutating the size of an already-allocated thread object in-situ would seem a non-starter, and you'd need a much more dynamic system to take advantage (say, maybe there's an array of thread IDs, type/flags, and pointers to them; and the pointers can be at members of {X}, or {Y}, or {heap}). Or also since we're talking such a limited platform: whether pointers are stored in extended memory, on what page(s), or via other means of access (HDD paging, networking, etc., depending on how deep and modern you want to go with it, lol).

Tim

tggzzz · « **Reply #61 on:** August 22, 2024, 08:27:32 am »

Quote from: T3sl4co1l on August 22, 2024, 03:25:04 am

The hazard of course is, cooperative requires cooperation, and you can't guarantee that with buggy or arbitrary code. For example, can you guarantee that all code paths and loops through a given function/module terminate within a reasonable time frame? I certainly can't, not that I'm a particularly bad coder or anything, but even small, well-crafted examples of the Halting Problem (taking practical functions over pathological extremes) are very rapidly nontrivial to prove.

I'm certainly not a fan of using formal methods to prove something; it is rare to find a problem/implementation where the proofs can be useful.

However, given an appropriate coding style and decent languages+tools, there can be useful timing predictions. Suitable coding styles are somewhat predicated on the application, of course. There are many applications - especially embedded - that do match.

Quote

Cooperation is also fairly trivial to implement, you don't need to smash anyones' stack, everything can be cleaned up neatly before return (yield() is more of a RET ... PROC entryPoint (...) syntactic sugar, than a function call as such -- an example of an inverted hierarchy); you can implement this in pretty much any language, even, you don't need ASM to glue it together, or interrupts or anything. It's great, when it works... but you can't guarantee that it works *in general*.

True, but then nothing works "in general"

The context is a Z80, and how someone could implement something simple and basic. Anybody expecting to be able to implement anything much more than a CP/M program loader is going to be disappointed.

For non-cooperative scheduling of competing applications, I'd choose a different starting point!

Quote

Meanwhile, the user wants to see a response, some time this...century y'know, so inevitably a thread must be stopped, shelved on the stack, and context switched over.

As I recall, this was a major complaint of early Macs, that were supposed to be cooperative but inevitably certain user programs would hang the whole system and become unresponsive, whether briefly (longer thread time than specified/desired) or pathologically (infinite loop, have to break or reset).

I still see that with some web-based applications running in a browser

(Wadda you mean? Surely a browser is the operating system

)

JCK · « **Reply #62 on:** September 06, 2024, 11:45:38 pm »

I haven't read this whole thread so I apologize if this has already been mentioned, but you might want to take a look at FreeRTOS, I believe there is a port for the Z80 and variants. This is a great RTOS and well documented and studying it can give you a lot of insight into multi-threading, because it's written by experts. I've used it for many, many years on various microprocessors with great success.

woofy · « **Reply #63 on:** September 07, 2024, 11:04:08 am »

One thing I've not seen mentioned is that the Z80 does not have a position independent instruction set. Branches are limited to 128 bytes and there are no relative calls. It can be worked around but its a hassle. That means you cannot just load a program "somewhere" in RAM and expect it to work.
Are you intending to run CPM like programs with RAM at zero and programs starting at 0x100, or only your own code?

Do you have a schematic of your current hardware you can share? It might help in suggesting options.

radiolistener · « **Reply #64 on:** September 09, 2024, 04:33:40 pm »

Quote from: woofy on September 07, 2024, 11:04:08 am

One thing I've not seen mentioned is that the Z80 does not have a position independent instruction set.

it has, for example JR, DJNZ.
For memory access you can use (IX/IY+offset) instructions which access memory relative to IX/IY register.

For JP/CALL you can also generate table and update it for specific address at runtime. This approach was often used on ZX Spectrum tools which allowed to load and run from any address.

Regarding to mutli-threading for Z80, I think this approach will not be effective due to slow speed and expensive context switch of Z80 core. There is just 64k address space for RAM, so you will lose a lot of RAM and performance resources just for context switch.

More effective approach will be to run two separate Z80 and have shared memory for data exchange.

tggzzz · « **Reply #65 on:** September 09, 2024, 07:15:25 pm »

Quote from: radiolistener on September 09, 2024, 04:33:40 pm

Quote from: woofy on September 07, 2024, 11:04:08 am
One thing I've not seen mentioned is that the Z80 does not have a position independent instruction set.

it has, for example JR, DJNZ.
For memory access you can use (IX/IY+offset) instructions which access memory relative to IX/IY register.

Arguably useful for Fortran arrays, but IX<-(IX+offset) would be more useful for chaining down C linked structures.

Quote

Regarding to mutli-threading for Z80, I think this approach will not be effective due to slow speed and expensive context switch of Z80 core. There is just 64k address space for RAM, so you will lose a lot of RAM and performance resources just for context switch.

Remove the adjectives, and insert numbers. Then compare with requirements.

Quote

More effective approach will be to run two separate Z80 and have shared memory for data exchange.

And for 8 threads?

brucehoult · « **Reply #66 on:** September 10, 2024, 01:13:59 am »

Quote from: radiolistener on September 09, 2024, 04:33:40 pm

Quote from: woofy on September 07, 2024, 11:04:08 am
One thing I've not seen mentioned is that the Z80 does not have a position independent instruction set.

it has, for example JR, DJNZ.
For memory access you can use (IX/IY+offset) instructions which access memory relative to IX/IY register.

IX/IY are slow and use extra prefix bytes in the code. LD r,(IX+n) is 19 cycles vs 7 for LD r,(HL). If stepping through RAM by 1, INC HL; LD r,(HL) is 6 cycles faster.

Quote

For JP/CALL you can also generate table and update it for specific address at runtime. This approach was often used on ZX Spectrum tools which allowed to load and run from any address.

Load-time relocation was common on all old OSes for CPUs without efficient PIC. You can also relocate absolute addresses for data access, not only call/jump. It just needs a data table which can be discarded from RAM after the relocation is done.

Quote

Regarding to mutli-threading for Z80, I think this approach will not be effective due to slow speed and expensive context switch of Z80 core. There is just 64k address space for RAM, so you will lose a lot of RAM and performance resources just for context switch.

More effective approach will be to run two separate Z80 and have shared memory for data exchange.

Registers are only 16 bytes per thread! [1] You can fit the state for 64 threads into 1K of RAM. Do you seriously propose somehow using 64 separate Z80 chips instead?

RAM for each thread's stack is a far larger problem, which is not solved by using multiple Z80s with shared memory, unless only a portion of the address space is shared.

Sharing the Z80's zero page is also an issue, though it's used far less than the 6502's zero page (absolutely needed if you use indirect addressing), so you could simply not use zero page addressing on the Z80.

[1] 24 bytes if you allow each thread to own AF',BC',DE',HL' as well, which you probably should.

JamesIsAwkward · « **Reply #67 on:** September 10, 2024, 04:59:01 pm »

As far as hardware goes this is a totally custom build.

I've actually stepped back from the software side for a bit in order to finish designing my core PCBs. I'm starting to hit the limit of my breadboard space on my desk and it is time to move this project to proper PCBs.

I'm doing a modular setup so I've designed a simple backplane and some daughter cards (CPU card, memory card, UART, keyboard/mouse, Compact Flash, etc).

I think I disagree with the idea that the Z80 will struggle with threading. There are plenty of z80 operating system projects that are threaded and run great AND most of them are preemptive threading in nature! SymbOS and KnightOS are good examples. Especially since I'm likely going to do a hard limit of 8 threads (or maybe 16 if I can make it run well enough).

My memory card v1 will just be a simple EEPROM + SRAM setup so i can get this thing fabbed and back to writing some software again.
I'm working on designing a v2 that will support bank switching and also hopefully support swapping the EEPROM with SRAM after bootloading to reclaim that chunk of memory space. (which now that I think about it I'm going to have to use I/O space to interface with the MMU so I need to take another look at my memory map and onboard address decoders on my backplane before I fab it...)

I've attached a picture of the backplane. This is the first PCB I've ever designed so I'm totally open to criticism!

SiliconWizard · « **Reply #68 on:** September 10, 2024, 10:28:02 pm »

Fun stuff.

tggzzz · « **Reply #69 on:** September 11, 2024, 01:17:55 am »

Quote from: JamesIsAwkward on September 10, 2024, 04:59:01 pm

I think I disagree with the idea that the Z80 will struggle with threading.

Everything will struggle at some point.

One of the major decision points in embedded systems is which functions must be in hardware (for low latency and/or high throughput), and which can be in software. Matching all the timing constraints is a key issue!

Quote

I've attached a picture of the backplane. This is the first PCB I've ever designed so I'm totally open to criticism!

Are you taking all Z80 control signals directly onto the backplane? That's not normal practice.

Be cautious about whether buffers are needed; the Z80 pins are not intended to drive "many" loads that are a "long" way away.

brucehoult · « **Reply #70 on:** September 11, 2024, 01:46:35 am »

Quote from: tggzzz on September 11, 2024, 01:17:55 am

Quote from: JamesIsAwkward on September 10, 2024, 04:59:01 pm
I think I disagree with the idea that the Z80 will struggle with threading.

Everything will struggle at some point.

It's about 150 clock cycles to push all main and shadow registers and PC and save the SP somewhere. And the same to load the registers for the next thread. That's 300 cycles total. Allow another 100 cycles to figure out which thread to run next and you get 400 cycles. 100 µs on a 4 MHz CPU.

A Z80 can so a full task-switch 10,000 times a second.

Most OSes from the 1970s to this day switch CPU-bound tasks (that finish their time slice) 100 times a second. So on the Z80 you'll use 1% of the CPU time to do that task switching.

That's not awful.

tggzzz · « **Reply #71 on:** September 11, 2024, 08:54:07 am »

Quote from: brucehoult on September 11, 2024, 01:46:35 am

Quote from: tggzzz on September 11, 2024, 01:17:55 am
Quote from: JamesIsAwkward on September 10, 2024, 04:59:01 pm
I think I disagree with the idea that the Z80 will struggle with threading.

Everything will struggle at some point.

It's about 150 clock cycles to push all main and shadow registers and PC and save the SP somewhere. And the same to load the registers for the next thread. That's 300 cycles total. Allow another 100 cycles to figure out which thread to run next and you get 400 cycles. 100 µs on a 4 MHz CPU.

A Z80 can so a full task-switch 10,000 times a second.

Most OSes from the 1970s to this day switch CPU-bound tasks (that finish their time slice) 100 times a second. So on the Z80 you'll use 1% of the CPU time to do that task switching.

That's not awful.

As someone using a Z80 for hard realtime embedded systems written in C (except a few lines of stack twiddling, of course!) in the early 80s, I know it is possible.

As I'm sure you know, in embedded systems, latency and predictability are usually at least as important as raw throughput.

As for "struggling at some point", it is worth calculating how many floating point operations a Z80 can do per second

brucehoult · « **Reply #72 on:** September 11, 2024, 09:19:25 am »

Quote from: tggzzz on September 11, 2024, 08:54:07 am

Quote from: brucehoult on September 11, 2024, 01:46:35 am
Quote from: tggzzz on September 11, 2024, 01:17:55 am
Quote from: JamesIsAwkward on September 10, 2024, 04:59:01 pm
I think I disagree with the idea that the Z80 will struggle with threading.

Everything will struggle at some point.

It's about 150 clock cycles to push all main and shadow registers and PC and save the SP somewhere. And the same to load the registers for the next thread. That's 300 cycles total. Allow another 100 cycles to figure out which thread to run next and you get 400 cycles. 100 µs on a 4 MHz CPU.

A Z80 can so a full task-switch 10,000 times a second.

Most OSes from the 1970s to this day switch CPU-bound tasks (that finish their time slice) 100 times a second. So on the Z80 you'll use 1% of the CPU time to do that task switching.

That's not awful.

As someone using a Z80 for hard realtime embedded systems written in C (except a few lines of stack twiddling, of course!) in the early 80s, I know it is possible.

As I'm sure you know, in embedded systems, latency and predictability are usually at least as important as raw throughput.

The above times are for fully general low priority CPU bound background tasks with forced time-slicing. The real-time parts can of course be made interrupt-driven and save and restore a lot fewer registers. in most cases. You can also make the ABI treat some of the registers as temporaries, so task switches driven by completing a unit of work and voluntarily returning to the OS can save and restore fewer registers.

Quote

As for "struggling at some point", it is worth calculating how many floating point operations a Z80 can do per second

Z80 and 6502 would not be my choice for FP-heavy tasks. Even an AVR will absolutely slaughter them.

I'm not familiar with Z80 floating point libraries but I'd guess an 80 bit FP multiply would be 10,000-15,000 cycles, 40 bit 2500-3000, and 24 bit (16 bit mantissa) something approaching 1000 cycles.

Even the last of which is several times more than a task-switch, emphasizing my point that task switch isn't especially painful in relative terms. It's just simply a slow CPU in general. But were were glad to have them in the late 70s and early 80s!

radiolistener · « **Reply #73 on:** September 12, 2024, 07:07:39 am »

Quote from: brucehoult on September 10, 2024, 01:13:59 am

Registers are only 16 bytes per thread! [1] You can fit the state for 64 threads into 1K of RAM.

At least 11 register pairs: AF, HL, DE, BC, IX, IY, AF', HL', DE', BC, SP

PUSH = 11 tacts, POP = 10 tacts.

At least 20 bytes memory and 231 tacts for registers.
With 3.5 MHz clock with no contention and 20 ms interrupt, it will eat about 0.3% time for context switch.

Quote from: brucehoult on September 10, 2024, 01:13:59 am

RAM for each thread's stack is a far larger problem, which is not solved by using multiple Z80s with shared memory, unless only a portion of the address space is shared.

yes, this is what I mean. And a part of 64k space is reserved for ROM.

If you use separate Z80 processors, they can have their own memory, with just one page shared between the two Z80s for data exchange. This solves the issue with stack memory because each Z80 has its own separate memory for the stack and other needs. This approach was used in some hardware controllers, which have their own Z80 processor and their own memory, separate from main Z80 processor and its memory.

The shared memory can be mapped to a switchable memory window, so you can completely hide it from each processor through the control port and use the full 64k address space for user task. In this way, only the OS core can map the shared memory to a specific window and use it for synchronization between the two Z80 processors.

tggzzz · « **Reply #74 on:** September 12, 2024, 07:31:58 am »

Quote from: radiolistener on September 12, 2024, 07:07:39 am

If you use separate Z80 processors, they can have their own memory, with just one page shared between the two Z80s for data exchange. This solves the issue with stack memory because each Z80 has its own separate memory for the stack and other needs. This approach was used in some hardware controllers, which have their own Z80 processor and their own memory, separate from main Z80 processor and its memory.

How, exactly does your concept scale to 2,3,4... threads?

What are the disadvantages to your concept? (Yes, there are quite a few)

If you want to see your concept executed well and advantageously, understand the XMOS xCORE/xC ecosystem.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Need some advice on threading on a Z80 (Read 11842 times)

Share me