Author Topic: Which FGPA/tool for this project? (Read 16946 times)

tggzzz · « **Reply #50 on:** September 25, 2017, 11:05:53 am »

Quote from: legacy on September 25, 2017, 09:53:25 am

Quote from: tggzzz on September 24, 2017, 11:10:39 pm
In the past there have been many many asymmetric multicore processors, and the programming environment has always been an afterthought
Which ones?

The ones that spring instantly to mind are Cell, Parallela, and - arguably - things like Zynq.

Perhaps it would help if I noted some things that I regard as indicating programming being an afterthought:

interprocessor comms where the correctness of the implementation depends on the compiler version and compiler flags
interprocessor comms which is part of a library, especially if the correctness cannot be checked by a compiler
interprocessor comms which can occur without the compiler being able to detect it, especially data structures shared between processors
interprocessor comms based on RPC and/or semaphores
not having parallel programming abstractions at a higher level than a programming language; design patterns help, but more formality is very helpful
a language where the defaults enable operations that, while safe in a single flow of control environment, can allow subtle rare misbehaviour in a parallel processing environment
ditto tools that concentrate on single flow of control behaviour, and don't highlight what's happening in a parallel environment
a belief that language deficiencies can be ~~covered up~~ ameliorated by the use of libraries
starting from single flow of control, and thinking it can be expanded to parallel processing. That's why there is an "impedance mismatch" between MCUs and FPGAs (and C/VHDL) which embedded programmers struggle to appreciate. People have been trying for decades to bridge that gap, with only very limited success

No, that list isn't exhaustive.
No, nothing meets everything on that list, but a few environments are significantly better than the herd.

nctnico · « **Reply #51 on:** September 25, 2017, 11:32:12 am »

Why should the compiler be involved with the correctness of how a multiprocessor system works? That is a (typical) task of an OS!

tggzzz · « **Reply #52 on:** September 25, 2017, 12:14:49 pm »

Quote from: nctnico on September 25, 2017, 11:32:12 am

Why should the compiler be involved with the correctness of how a multiprocessor system works? That is a (typical) task of an OS!

A language and its compiler's task is to define what's in a user's processes. An OS's task is to execute user processes. The presence (or absence) of an OS is orthogonal to parallel processing.

The issues w.r.t. parallelism and languages is very broad; new ways of thinking and implementing are necessary. One well-known example is recasting your thinking in terms of map-reduce algorithms. Others are Communicating Sequential Processes (CSP) and tuple-based systems such as Linda; Lamport also had fundamental insights.

CSP maps neatly onto many different implementations, from hardware, through embedded software (single or multiprocessor), to distributed systems - in whatever language happens to be convenient.

The concept and tools are still being developed, but some concepts have stood the test of time over the decades. It is well worth understanding their key features and why people keep returning to them.

(There is an ancillary question: if they are so good, why aren't they in ubiquitous use? Simple: Moore's "law" has dominated progress for half a century, but now that it is running out of steam people are having to confront developing parallel systems).

legacy · « **Reply #53 on:** September 25, 2017, 12:40:16 pm »

Quote from: nctnico on September 25, 2017, 11:32:12 am

Why should the compiler be involved with the correctness of how a multiprocessor system works?

why do you think we can't safely write lib-thread in C?

tggzzz · « **Reply #54 on:** September 25, 2017, 02:32:49 pm »

Quote from: legacy on September 25, 2017, 12:40:16 pm

Quote from: nctnico on September 25, 2017, 11:32:12 am
Why should the compiler be involved with the correctness of how a multiprocessor system works?

why do you think we can't safely write lib-thread in C?

Touché!

I'm told that the latest C standard contains such revolutionary concepts as, gasp, a memory model. Of course who knows when the first complete compiler will arrive[1], and that's of little comfort to those that still have to use C89 or whatever.

[1] ISTR it took ~5 years for the first complete C99 (?C99++) compiler to appear. I certainly remember a triumphant commercial to that effect somewhere around 2004/05.

legacy · « **Reply #55 on:** September 25, 2017, 03:32:35 pm »

Quote from: tggzzz on September 25, 2017, 02:32:49 pm

Touché!

Yup, if you have time and can give an eye on SGI documentation; it's written that they modified the common C language by adding a special pre and post processor and tons of #pragma

SGI systems were an example of multiCPUs system, starting from 2 CPUs to 4CPUs per motherboard, to 128CPUs per rack system connected trough NUMAlink.

rstofer · « **Reply #56 on:** September 25, 2017, 10:44:20 pm »

Quote from: tggzzz on September 25, 2017, 07:24:03 am

Quote from: rstofer on September 24, 2017, 11:57:44 pm
Quote from: tggzzz on September 24, 2017, 06:29:38 pm
Do you know of any current processors that have decent performance and do have such timing specs?

No, but given that these are register to register operations, it should be possible to determine how many cycles it takes to add or multiply. Add is complicated due to alignment and would probably be omitted or bounded. I can't see any reason ARM couldn't describe the number of clocks required to multiply to reals.

What guarantees the variables are kept in registers? They probably are, but compiler optimisation algorithms are notoriously fickle and, um, "heuristic".

If the system performance is solely dependent on such an inner-loop, then fine. However in most systems detailed timing is dependent on other factors, e.g. interrupts, memory accesses in other parts of the codebase, etc, etc.

Floating point arithmetic performance is more or less impossible to guarantee, especially if IEEE754 is involved. Not only can operations be short-circuited, but operations involving denorm number are often notoriously slow - they often require fixups in software. What happens is very implementation dependent, and therefore highly non-portable.

The RPi uses a Cortex A-53 RISC processor so all variables ARE in registers before they are operated on, even if only temporarily. Typically, the compiler loads a register with a value from memory, loads another if necessary and then does the math operation on the registers leaving the result in some register. The compiler may then emit code to store the result back to memory.

My interest is how long it takes to do a floating point multiply AFTER the values are loaded into registers. This eliminates memory accesses and, given the dedicated CPU, there won't be any interrupts (they are disabled for the dedicated core in the example).

I guess I should quite whining and sniveling and just write a bit more code into the demo program. Just enough ASM to load the registers and then loop on multiply while incrementing a counter. One of the other cores will pick up the loop count and print the value every second. That's what the existing code is doing but it is only incrementing an integer.

tggzzz · « **Reply #57 on:** September 26, 2017, 08:40:53 am »

Quote from: rstofer on September 25, 2017, 10:44:20 pm

Quote from: tggzzz on September 25, 2017, 07:24:03 am
Quote from: rstofer on September 24, 2017, 11:57:44 pm
Quote from: tggzzz on September 24, 2017, 06:29:38 pm
Do you know of any current processors that have decent performance and do have such timing specs?

No, but given that these are register to register operations, it should be possible to determine how many cycles it takes to add or multiply. Add is complicated due to alignment and would probably be omitted or bounded. I can't see any reason ARM couldn't describe the number of clocks required to multiply to reals.

What guarantees the variables are kept in registers? They probably are, but compiler optimisation algorithms are notoriously fickle and, um, "heuristic".

If the system performance is solely dependent on such an inner-loop, then fine. However in most systems detailed timing is dependent on other factors, e.g. interrupts, memory accesses in other parts of the codebase, etc, etc.

Floating point arithmetic performance is more or less impossible to guarantee, especially if IEEE754 is involved. Not only can operations be short-circuited, but operations involving denorm number are often notoriously slow - they often require fixups in software. What happens is very implementation dependent, and therefore highly non-portable.

The RPi uses a Cortex A-53 RISC processor so all variables ARE in registers before they are operated on, even if only temporarily. Typically, the compiler loads a register with a value from memory, loads another if necessary and then does the math operation on the registers leaving the result in some register. The compiler may then emit code to store the result back to memory.

My interest is how long it takes to do a floating point multiply AFTER the values are loaded into registers. This eliminates memory accesses and, given the dedicated CPU, there won't be any interrupts (they are disabled for the dedicated core in the example).

I guess I should quite whining and sniveling and just write a bit more code into the demo program. Just enough ASM to load the registers and then loop on multiply while incrementing a counter. One of the other cores will pick up the loop count and print the value every second. That's what the existing code is doing but it is only incrementing an integer.

You definition excludes the unresolvable timing problems found in applications in most processors, i.e. memory/cache accesses and interrupts.

Now you have stated that, it is valid, but not very useful.

In contrast, the xCORE tools easily predict without executing that (on a first generation XS1 device) a "float" multiply takes 1.032us max (0.408us min), and a "double" multiply takes 2.632us max (0.560 min). Don't forget that's per core, and there can be many cores executing simultaneously!

rstofer · « **Reply #58 on:** September 26, 2017, 01:26:52 pm »

Quote from: tggzzz on September 26, 2017, 08:40:53 am

You definition excludes the unresolvable timing problems found in applications in most processors, i.e. memory/cache accesses and interrupts.

Now you have stated that, it is valid, but not very useful.

In contrast, the xCORE tools easily predict without executing that (on a first generation XS1 device) a "float" multiply takes 1.032us max (0.408us min), and a "double" multiply takes 2.632us max (0.560 min). Don't forget that's per core, and there can be many cores executing simultaneously!

My only interest is in seeing how many FLOPS I can get from a dedicated core running in a tight loop - hopefully with no cache misses. This number would only be valid in the very specific case of a dedicated CPU but with 32 sets of floating point registers, a lot of math can be done without going to memory. And, of course, it would only be for a 64 bit multiply. Just a bit of a benchmark, I suppose.

The idea is to compare the FLOPS against the 60 MHz processor the OP is using. There's quite a bit of horsepower in a RPi 3b.

tggzzz · « **Reply #59 on:** September 26, 2017, 04:43:39 pm »

Quote from: rstofer on September 26, 2017, 01:26:52 pm

Quote from: tggzzz on September 26, 2017, 08:40:53 am

You definition excludes the unresolvable timing problems found in applications in most processors, i.e. memory/cache accesses and interrupts.

Now you have stated that, it is valid, but not very useful.

In contrast, the xCORE tools easily predict without executing that (on a first generation XS1 device) a "float" multiply takes 1.032us max (0.408us min), and a "double" multiply takes 2.632us max (0.560 min). Don't forget that's per core, and there can be many cores executing simultaneously!

My only interest is in seeing how many FLOPS I can get from a dedicated core running in a tight loop - hopefully with no cache misses. This number would only be valid in the very specific case of a dedicated CPU but with 32 sets of floating point registers, a lot of math can be done without going to memory. And, of course, it would only be for a 64 bit multiply. Just a bit of a benchmark, I suppose.

The idea is to compare the FLOPS against the 60 MHz processor the OP is using. There's quite a bit of horsepower in a RPi 3b.

It helps if the memory accesses are all single-cycle, of course.

Are you interested in floating point (FLOPS) or integer (64-bit) arithmetic? Integer arithmetic in C is faster, of course - partly because it ignores all conditions that are the least bit awkward.

rstofer · « **Reply #60 on:** September 26, 2017, 05:07:28 pm »

Quote from: tggzzz on September 26, 2017, 04:43:39 pm

It helps if the memory accesses are all single-cycle, of course.

Are you interested in floating point (FLOPS) or integer (64-bit) arithmetic? Integer arithmetic in C is faster, of course - partly because it ignores all conditions that are the least bit awkward.

I'm looking at whether the RPi is a candidate for the OP's desired speed-up of a 67 MHz single core, software floating point processor. It's not important to me but I was curious about things in general:

What's Ultibo all about?
How does the threading system work?
If I had an application, how do I dedicate a CPU to a particular task?
What kind of single CPU floating point performance could I expect?
Given the multitude of FP registers, is this perhaps an ideal platform for quad copter control/navigation?
Or, given the complexity, is it overkill?

So, really, it's just for giggles...

But it's pretty clear that the RPi with hardware floating point, 4 cores and 1.2 GHz clock (per core) should be adequate for the required speed-up. If, in fact, the OP even uses FP.

Truth be known, just porting the code and running under Linux would probably meet the requirements. Except for the part where Linux lack any semblance of hard real-time capability.

technix · « **Reply #61 on:** September 26, 2017, 05:31:15 pm »

Quote from: rstofer on September 26, 2017, 05:07:28 pm

Quote from: tggzzz on September 26, 2017, 04:43:39 pm

It helps if the memory accesses are all single-cycle, of course.

Are you interested in floating point (FLOPS) or integer (64-bit) arithmetic? Integer arithmetic in C is faster, of course - partly because it ignores all conditions that are the least bit awkward.

I'm looking at whether the RPi is a candidate for the OP's desired speed-up of a 67 MHz single core, software floating point processor. It's not important to me but I was curious about things in general:

What's Ultibo all about?
How does the threading system work?
If I had an application, how do I dedicate a CPU to a particular task?
What kind of single CPU floating point performance could I expect?
Given the multitude of FP registers, is this perhaps an ideal platform for quad copter control/navigation?
Or, given the complexity, is it overkill?

So, really, it's just for giggles...

But it's pretty clear that the RPi with hardware floating point, 4 cores and 1.2 GHz clock (per core) should be adequate for the required speed-up. If, in fact, the OP even uses FP.

Truth be known, just porting the code and running under Linux would probably meet the requirements. Except for the part where Linux lack any semblance of hard real-time capability.

Actually in Linux you can set up the kernel and systemd never to schedule any process or route any interrupts to specific cores unless specifically told to. This way you can shove the Linux kernel onto one or two dedicated cores and have all the remaining cores dedicated to your code. This way you get the convenience of Linux kernel and tools without losing too much real time performance.

rstofer · « **Reply #62 on:** September 26, 2017, 07:19:14 pm »

Quote from: technix on September 26, 2017, 05:31:15 pm

Actually in Linux you can set up the kernel and systemd never to schedule any process or route any interrupts to specific cores unless specifically told to. This way you can shove the Linux kernel onto one or two dedicated cores and have all the remaining cores dedicated to your code. This way you get the convenience of Linux kernel and tools without losing too much real time performance.

I didn't know that! I'll have to do some research...
Given 4 CPUs, the system should be fine running on 3 and leaving one for the time critical stuff.
I guess I really need to understand how messages are passed back and forth.

NorthGuy · « **Reply #63 on:** September 26, 2017, 07:26:21 pm »

Quote from: rstofer on September 26, 2017, 07:19:14 pm

I didn't know that! I'll have to do some research...
Given 4 CPUs, the system should be fine running on 3 and leaving one for the time critical stuff.

Even though this is a separate core, it still has shared memory access for both data and program. So, the "dedicated" core is unlikely to be completely unaffected, even in terms of average performance.

technix · « **Reply #64 on:** September 26, 2017, 07:41:40 pm »

Quote from: NorthGuy on September 26, 2017, 07:26:21 pm

Quote from: rstofer on September 26, 2017, 07:19:14 pm
I didn't know that! I'll have to do some research...
Given 4 CPUs, the system should be fine running on 3 and leaving one for the time critical stuff.

Even though this is a separate core, it still has shared memory access for both data and program. So, the "dedicated" core is unlikely to be completely unaffected, even in terms of average performance.

You can code with cache lines in mind and minimize syscalls (as it triggers instruction cache flushes) you can make your code stay in the cache in the most time.

legacy · « **Reply #65 on:** September 26, 2017, 08:46:00 pm »

Quote from: technix on September 26, 2017, 05:31:15 pm

Actually in Linux you can set up the kernel and systemd never to schedule any process or route any interrupts to specific cores unless specifically told to.

How can you do it without hacking the kernel?

Marco · « **Reply #66 on:** September 26, 2017, 08:56:41 pm »

This doesn't get you hard real time, the kernel itself is significantly single threaded. So the moment your code has to do I/O you can hit some huge latency spike again.

technix · « **Reply #67 on:** September 27, 2017, 12:02:04 am »

Quote from: legacy on September 26, 2017, 08:46:00 pm

Quote from: technix on September 26, 2017, 05:31:15 pm
Actually in Linux you can set up the kernel and systemd never to schedule any process or route any interrupts to specific cores unless specifically told to.

How can you do it without hacking the kernel?

You can pass it some command line arguments for that. People uses this type of core isolation to prepare the Pi to run KVM-backed virtual machines, and you can just isolate the cores but skip the KVM.

technix · « **Reply #68 on:** September 27, 2017, 12:04:37 am »

Quote from: Marco on September 26, 2017, 08:56:41 pm

This doesn't get you hard real time, the kernel itself is significantly single threaded. So the moment your code has to do I/O you can hit some huge latency spike again.

If you can live without interrupts (only polling) you can implement the drivers directly in user mode operating on /dev/mem.

NorthGuy · « **Reply #69 on:** September 27, 2017, 01:00:51 am »

Quote from: technix on September 27, 2017, 12:04:37 am

If you can live without interrupts (only polling) you can implement the drivers directly in user mode operating on /dev/mem.

And if you can live without Linux ... you get all this plus lots of freed up resources

rstofer · « **Reply #70 on:** September 27, 2017, 01:03:52 am »

Quote from: NorthGuy on September 27, 2017, 01:00:51 am

Quote from: technix on September 27, 2017, 12:04:37 am
If you can live without interrupts (only polling) you can implement the drivers directly in user mode operating on /dev/mem.

And if you can live without Linux ... you get all this plus lots of freed up resources

Thus my interest in Ultibo - it allows me to blow off Linux for a much lighter kernel. Yes, there is still a kernel including a scheduler and various other bits and pieces left behind but it is mostly out of the way (AFAICT).

technix · « **Reply #71 on:** September 27, 2017, 02:41:15 am »

Quote from: NorthGuy on September 27, 2017, 01:00:51 am

Quote from: technix on September 27, 2017, 12:04:37 am
If you can live without interrupts (only polling) you can implement the drivers directly in user mode operating on /dev/mem.

And if you can live without Linux ... you get all this plus lots of freed up resources

But what do you lose? At least you still have access to standard POSIX stuff and IPC on Linux, and through the use of user-mode IPC (shared memory etc) you can split the priorities into RT-intense (dedicated cores) and non-RT (share cores with the kernel) parts.

legacy · « **Reply #72 on:** September 27, 2017, 08:22:27 am »

Quote from: technix on September 27, 2017, 12:02:04 am

You can pass it some command line arguments for that. People uses this type of core isolation to prepare the Pi to run KVM-backed virtual machines, and you can just isolate the cores but skip the KVM.

kernel 2.6.* you have to hack in a way that looks like RTAI
kernel 3.* keeps getting better, but it's practically like the above
kernel >=4.* ... oh, there is something really really interesting here

technix · « **Reply #73 on:** September 27, 2017, 09:23:21 am »

Quote from: rstofer on September 27, 2017, 01:03:52 am

Quote from: NorthGuy on September 27, 2017, 01:00:51 am
Quote from: technix on September 27, 2017, 12:04:37 am
If you can live without interrupts (only polling) you can implement the drivers directly in user mode operating on /dev/mem.

And if you can live without Linux ... you get all this plus lots of freed up resources

Thus my interest in Ultima - it allows me to blow off Linux for a much lighter kernel. Yes, there is still a kernel including a scheduler and various other bits and pieces left behind but it is mostly out of the way (AFAICT).

Well you lose a good selection of features.

You can spawn two processes: one running on dedicated cores doing hard RT stuff, another monitor program making syscalls for it. The hard RT process and the monitor communicates through shared memory (which does not incur syscall overhead) When the hard RT program need something from the kernel it pings the monitor using the shared memory, the monitor issues the syscalls for it and even preprocess the data before throwing it into the shared memory region for the hard RT process. This way you get both an almost hard RT environment and full Linux support.

NorthGuy · « **Reply #74 on:** September 27, 2017, 06:34:56 pm »

Quote from: technix on September 27, 2017, 02:41:15 am

Quote from: NorthGuy on September 27, 2017, 01:00:51 am
And if you can live without Linux ... you get all this plus lots of freed up resources
But what do you lose?

You lose (in no particular order):

- fast boot times
- quite a bit of memory
- some of the speed
- a little of reliability
- almost all predictability
- most of the real time control


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Which FGPA/tool for this project? (Read 16946 times)

Share me