Author Topic: The search for a (CHEAP) supercomputer... (Read 18367 times)

dom0 · « **Reply #25 on:** April 30, 2017, 05:18:26 pm »

"That CISC decoder is so complex and power hungry and needs so many transistors" is pretty much just 90s RISC propaganda these days. Heck, it was the 90s when Intel, AMD and some other companies(!) showed off that the ISA on the frontend doesn't matter and that the popular marketing, even academic, opinion that x86 will never be as high performance as RISC (POWER/SPARC/Alpha), because it's "literally impossible to use pipelining" or make procedure calls fast or any other disproven "fact", is bull*****.

The truth is that it doesn't matter and that's precisely the reason why all RISC architectures were flushed down the drain by 2000 in the mainstream server and workstation markets. A bit later all long-pipeline µarchs were flushed after them (some pun intended). Pipe didn't clog from those, either.

dom0 · « **Reply #26 on:** April 30, 2017, 05:22:56 pm »

Quote from: GeorgeOfTheJungle on April 30, 2017, 02:55:14 pm

Quote from: Fungus on April 30, 2017, 12:06:56 pm
For double precision floating point a desktop CPU will easily win against ARM chips...

Right you are. There you go:

My thousands of freedom kopecks i7 MacBook Pro:
jorge@unibody:~/kk$ gcc -O0 threads.c -o threads
jorge@unibody:~/kk$ time ./threads

real   0m0.719s
user   0m2.238s
sys   0m0.098s

The $8 Orange Pi Zero:
pi@orangepi:~$ gcc -lpthread -O0 threads.c -o threads
pi@orangepi:~$ time ./threads

real   0m9.754s
user   0m7.940s
sys   0m25.370s

Or about 14x the speed at more than 100x the price.

Code: [Select]
#include <pthread.h> #include <stdlib.h> #include <stdio.h> #define BUF_SIZE 16*1024*1024 pthread_t t1, t2, t3, t4; double* nu_buffer () { double* buffer= malloc(BUF_SIZE); int i= BUF_SIZE/8; while (i--) buffer[i]= (random()-random())/(random() + 1.0); return buffer; } void* thread_proc (void* arg) { int i= BUF_SIZE/8; double* buffer= nu_buffer(); while (i--) buffer[i]= (((double*) arg)[i] + buffer[i]) / 2.0; return buffer; } void do_threads_stuff (double* buffer) { pthread_create(&t1, NULL, thread_proc, buffer); pthread_create(&t2, NULL, thread_proc, buffer); pthread_create(&t3, NULL, thread_proc, buffer); pthread_create(&t4, NULL, thread_proc, buffer); void** r; pthread_join(t1, r), free(*r); pthread_join(t2, r), free(*r); pthread_join(t3, r), free(*r); pthread_join(t4, r), free(*r); } int main (void) { double* buffer= nu_buffer(); do_threads_stuff(buffer); //free(buffer); return 0; }

-O0, seriously? Are you trying to make a joke out of yourself?

Pretty sure random() will take a lock on each call, the program probably spends most of the time contending over that lock, not doing the calculations.

Edit: And, indeed, using random_r correctly, not contending one lock from all threads, speeds this up by a factor 15.

dom0 · « **Reply #27 on:** April 30, 2017, 05:28:15 pm »

By the way, all cheap ARM SBCs are based on toy SoCs with toy cores like the A53, which is a low performance low power (low efficiency) core, with a low performance memory system and laughable I/O capability. They're today's bottom-of-the-barrel chips in the ARM world.

GeorgeOfTheJungle · « **Reply #28 on:** April 30, 2017, 06:16:04 pm »

Quote from: dom0 on April 30, 2017, 05:22:56 pm

-O0, seriously? Are you trying to make a joke out of yourself?

Pretty sure random() will take a lock on each call, the program probably spends most of the time contending over that lock, not doing the calculations.

Edit: And, indeed, using random_r correctly, not contending one lock from all threads, speeds this up by a factor 15.

Feel free to fix it! My broken code is free and open source! Hahaha.

Edit: I see no random() in the main thread loop, so ?

Code: [Select]

while (i--) buffer[i]= (((double*) arg)[i] + buffer[i]) / 2.0;

dom0 · « **Reply #29 on:** April 30, 2017, 08:17:04 pm »

All threads execute nu_buffer() which calls into random() three times per i.

Also observe the numbers you posted, on ARM you get "sys 0m25.370s". This thing does no IO at all, no syscalls itself. What could cause this? Note: Locks on Linux typically boils down to futexes, whose slow (contended) path goes through the kernel (ouchies).

Also I wouldn't be at all surprised if "while (i--) buffer[i ]= (random()-random())/(random() + 1.0);" is significantly more expensive than "while (i--) buffer[ i]= (((double*) arg)[ i] + buffer[i ]) / 2.0;". Depending on the PRNG used by the libc, quite some of these are very slow, much slower than 2 FLOP (add, div).

---

Ok. Let's do this.

Your version (bench1.c).

Code: [Select]

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>

#define BUF_SIZE 16*1024*1024
pthread_t t1, t2, t3, t4;

double* nu_buffer () {
    double* buffer= malloc(BUF_SIZE);
    int i= BUF_SIZE/8;
    while (i--) buffer[i]= (random()-random())/(random() + 1.0);
    return buffer;
}

void* thread_proc (void* arg) {
    int i= BUF_SIZE/8;
    double* buffer= nu_buffer();
    while (i--) buffer[i]= (((double*) arg)[i] + buffer[i]) / 2.0;
    return buffer;
}

void do_threads_stuff (double* buffer) {
    pthread_create(&t1, NULL, thread_proc, buffer);
    pthread_create(&t2, NULL, thread_proc, buffer);
    pthread_create(&t3, NULL, thread_proc, buffer);
    pthread_create(&t4, NULL, thread_proc, buffer);
    void** r;
    pthread_join(t1, r), free(*r);
    pthread_join(t2, r), free(*r);
    pthread_join(t3, r), free(*r);
    pthread_join(t4, r), free(*r);
}

int main (void) {
    double* buffer= nu_buffer();
    do_threads_stuff(buffer);
    //free(buffer); return 0;
}

Note corruption in do_threads_stuff. Let's fix that first, shall we?

Code: [Select]

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>

#define BUF_SIZE 16*1024*1024
pthread_t t1, t2, t3, t4;

double* nu_buffer () {
    double* buffer= malloc(BUF_SIZE);
    int i= BUF_SIZE/8;
    while (i--) buffer[i]= (random()-random())/(random() + 1.0);
    return buffer;
}

void* thread_proc (void* arg) {
    int i= BUF_SIZE/8;
    double* buffer= nu_buffer();
    while (i--) buffer[i]= (((double*) arg)[i] + buffer[i]) / 2.0;
    return buffer;
}

void do_threads_stuff (double* buffer) {
    pthread_create(&t1, NULL, thread_proc, buffer);
    pthread_create(&t2, NULL, thread_proc, buffer);
    pthread_create(&t3, NULL, thread_proc, buffer);
    pthread_create(&t4, NULL, thread_proc, buffer);
    void* r;
    pthread_join(t1, &r), free(r);
    pthread_join(t2, &r), free(r);
    pthread_join(t3, &r), free(r);
    pthread_join(t4, &r), free(r);
}

int main (void) {
    double* buffer= nu_buffer();
    do_threads_stuff(buffer);
    //free(buffer); return 0;
}

Ok. Let's run that.

Code: [Select]

3.06user 1.63system 0:01.39elapsed 337%CPU (0avgtext+0avgdata 83264maxresident)k
0inputs+0outputs (0major+643minor)pagefaults 0swaps

Ok, pretty much the same bad result you got. Note high system time.

Now let's fix that random() mess.

Code: [Select]

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define BUF_SIZE 16*1024*1024
pthread_t t1, t2, t3, t4;

double* nu_buffer () {
    double* buffer= (double*)malloc(BUF_SIZE);
    struct random_data rd;
    char sb[64];
    memset(&rd, 0, sizeof(struct random_data));
    initstate_r(random(), sb, 64, &rd);
    int i= BUF_SIZE/8;
    while (i--) {
      int r1, r2, r3;
      random_r(&rd, &r1);
      random_r(&rd, &r2);
      random_r(&rd, &r3);
      buffer[i]= (r1-r2)/(r3 + 1.0);
    }
    return buffer;
}

void* thread_proc (void* arg) {
    int i= BUF_SIZE/8;
    double* buffer= nu_buffer();
    while (i--) buffer[i]= (((double*) arg)[i] + buffer[i]) / 2.0;
    return buffer;
}

void do_threads_stuff (double* buffer) {
    pthread_create(&t1, NULL, thread_proc, buffer);
    pthread_create(&t2, NULL, thread_proc, buffer);
    pthread_create(&t3, NULL, thread_proc, buffer);
    pthread_create(&t4, NULL, thread_proc, buffer);
    void* r;
    pthread_join(t1, &r), free(r);
    pthread_join(t2, &r), free(r);
    pthread_join(t3, &r), free(r);
    pthread_join(t4, &r), free(r);
}

int main (void) {
    double* buffer= nu_buffer();
    do_threads_stuff(buffer);
    free(buffer); return 0;
}

=

Code: [Select]

0.16user 0.00system 0:00.06elapsed 258%CPU (0avgtext+0avgdata 83176maxresident)k
0inputs+0outputs (0major+639minor)pagefaults 0swaps

On this machine the difference is even a bit closer to 20.

dom0 · « **Reply #30 on:** April 30, 2017, 08:19:38 pm »

Btw. I don't see what you're attempting here, it's not only a micro-benchmark but also an markedly bad one. There are much better vetted benchmarks for this. Heck, even ol' LINPACK is better at this. Try STREAM for a quick gloss over the memory subsystem. Take a look at lmbench3 for microbenchmarks that actually microbenchmark interesting properties. Say, lat_ctx.

GeorgeOfTheJungle · « **Reply #31 on:** April 30, 2017, 08:32:51 pm »

I get:

jorge@unibody:~/kk$ gcc -O0 threads.c -o threads
threads.c: In function ‘nu_buffer’:
threads.c:20: error: storage size of ‘rd’ isn’t known

line 20 is:

struct random_data rd;

dom0 · « **Reply #32 on:** April 30, 2017, 08:35:47 pm »

This is a glibc extension.

With other libc's you may have to use their equivalent or if they don't have one bring your own PRNG.

Marco · « **Reply #33 on:** April 30, 2017, 08:45:56 pm »

Quote from: dom0 on April 30, 2017, 05:18:26 pm

"That CISC decoder is so complex and power hungry and needs so many transistors" is pretty much just 90s RISC propaganda these days. Heck, it was the 90s when Intel, AMD and some other companies(!) showed off that the ISA on the frontend doesn't matter and that the popular marketing, even academic, opinion that x86 will never be as high performance as RISC (POWER/SPARC/Alpha), because it's "literally impossible to use pipelining" or make procedure calls fast or any other disproven "fact", is bull*****.

IMO the issue has been that as you add wider superscalar execution and other architectural changes with the same ISA, elegant ISAs with features purpose designed for a given implementation (VLIW, delay slots etc) only get in the way with the next generation of processor trying to wring out a few extra percent of IPC. Even the instruction bloat of RISC becomes a disadvantage. When you have to heap hacks and cludges on a giant pile to apply increasingly speculative execution with an instruction set and software never designed/compiled for it, CISC becomes an advantage.

Quote

The truth is that it doesn't matter and that's precisely the reason why all RISC architectures were flushed down the drain by 2000 in the mainstream server and workstation markets. A bit later all long-pipeline µarchs were flushed after them (some pun intended). Pipe didn't clog from those, either.

AFAICS they just got rid of the trace cache.

GeorgeOfTheJungle · « **Reply #34 on:** April 30, 2017, 08:47:05 pm »

Quote from: dom0 on April 30, 2017, 08:17:04 pm

Note corruption in do_threads_stuff. Let's fix that first, shall we?

Sorry I dont get it, how is

Code: [Select]

    void** r;
    pthread_join(t1, r), free(*r);
    pthread_join(t2, r), free(*r);
    pthread_join(t3, r), free(*r);
    pthread_join(t4, r), free(*r);

Different than:

Code: [Select]

    void* r;
    pthread_join(t1, &r), free(r);
    pthread_join(t2, &r), free(r);
    pthread_join(t3, &r), free(r);
    pthread_join(t4, &r), free(r);

?

Edit: Never mind

dom0 · « **Reply #35 on:** April 30, 2017, 08:59:22 pm »

Quote from: Marco on April 30, 2017, 08:45:56 pm

Yes, ISAs that directly tie to the layout and timing of execution units were never sustainable; and VLIW as a general concept only worked so-so for compilers.

The change away from VLIW in AMDs GPUs is an interesting facet here.

-

Core and Netburst are very different designs overall. The pipelines of the current µarchs are still much more like Core. One could say the design principles just don't work out; high clocks are worthless if the CPU is inefficient. IBM with their POWER µarchs found this as well.

dom0 · « **Reply #36 on:** April 30, 2017, 09:00:36 pm »

Quote from: GeorgeOfTheJungle on April 30, 2017, 08:47:05 pm

Quote from: dom0 on April 30, 2017, 08:17:04 pm
Note corruption in do_threads_stuff. Let's fix that first, shall we?

Sorry I dont get it, how is

Code: [Select]
void** r; pthread_join(t1, r), free(*r); pthread_join(t2, r), free(*r); pthread_join(t3, r), free(*r); pthread_join(t4, r), free(*r);
Different than:

Code: [Select]
void* r; pthread_join(t1, &r), free(r); pthread_join(t2, &r), free(r); pthread_join(t3, &r), free(r); pthread_join(t4, &r), free(r);
?

Above: r is an uninitialised pointer on the stack (likely garbage) and you tell pthread_join to write to that uninitialized pointer (so a write goes somewhere garbage-y in the memory).

Below: r is a variable on the stack (which is also a pointer), and pthread_join writes to that stack variable.

GeorgeOfTheJungle · « **Reply #37 on:** April 30, 2017, 09:30:53 pm »

Cool, now the OPi (with dom0's code above) gives:

pi@orangepi:~$ gcc -lpthread -O0 threads.c -o threads
pi@orangepi:~$ time ./threads

real   0m0.831s
user   0m2.070s
sys   0m0.230s

But OSX does not have random_r, any ideas?

brucehoult · « **Reply #38 on:** April 30, 2017, 09:34:22 pm »

Quote from: NiHaoMike on April 30, 2017, 04:34:21 pm

Quote from: brucehoult on April 30, 2017, 10:59:49 am
Even the accumulated ARM/Thumb/Thumb2/Jazelle/ThumbEE mess hurts you quite a lot. Those Cavium CPUs are 64 bit *only*. No legacy 32 bit ARM instruction sets.
I remember reading that Apple plans to phase out 32 bit support from iOS. About what % advantage would that yield?

I haven't heard that, but I would *expect* it, given that

1) Apple's been designing their own CPUs for a while, and had 64 bit since the iPhone 5s in 2013.

2) iOS apps have by default since mid 2015 been uploaded to the AppStore as llvm bitcode NOT machine code, and Apple delivers 32 bit versions to 32 bit devices and 64 bit versions to 64 bit devices. I believe a developer can still turn off bitcode for iOS, but it is compulsory for WatchOS and TVOS.

Thus Apple have the ability to use a 64 bit-only ARM any time they want to .. or even another CPU type entirely.

GeorgeOfTheJungle · « **Reply #39 on:** April 30, 2017, 09:49:49 pm »

Ok so I simply won't fill the 16MB buffers with random(), just average whatever happens to be there (most likely zeros or garbage) as doubles in parallel in 4 threads:

https://gist.github.com/xk/b8b2ff4ab1455237906a8b13e3f1f02f

The i7 Mac:
unibodySierra:Desktop admin$ gcc -O0 threads.c -o threads
unibodySierra:Desktop admin$ time ./threads

real   0m0.054s
user   0m0.084s
sys   0m0.073s

And the OPi Zero:
pi@orangepi:~$ gcc -lpthread -O0 threads.c -o threads
pi@orangepi:~$ time ./threads

real   0m0.166s
user   0m0.440s
sys   0m0.140s

And the Mac now is 0,166/0,054= 3x times faster, and more than 100x as expensive.

brucehoult · « **Reply #40 on:** April 30, 2017, 09:54:53 pm »

Quote from: dom0 on April 30, 2017, 05:11:36 pm

You have some good info there and some completely wrong.

And what was wrong?

Quote

64-bit ARM cores tend to have _four_ frontends for the different ARM ISAs.

Strangely enough, I know this. I even listed the ISAs.

Quote

Only some specialty cores can afford to implement only one decoder, e.g. Apple and Cavium do that.

Strangely enough, I know this. I even mentioned Cavium. I hadn't heard for sure that Apple has done this but I've been expecting it ever since they required apps to include 64 bit versions.

Quote

Modern x86 designs (Intel, AMD Zen) are massively out-of-order cores (~300 instructions in-flight), usually with an 8-way backend ("ports", "8-issue") and a massive register file, neither of which is done by any ARM core (if you would, you'd need pretty much the same power as everyone else, big surprise). The ISA implemented by a CPU plays a rather minor role in it's efficiency and power consumption today.

Strangely enough, I know all this, and said as much.

Quote

Similarly if you look at a POWER or SPARC core the decoder takes a large part of the logic transistor budget; in almost all CPU cores -regardless of ARM/x86/POWER/SPARC- nowadays the decoder (the parts you can identify as one) is usually the largest logic part (even on the floor plan).

You're making my point for me.

Quote

The reason for the increased ALU efficiency in GPUs comes from putting many ALUs under the control of one frontend and one scheduler, and having exactly no dynamic optimization, be it register renaming, OooE or branch prediction (any branching, actually; GPUs do not branch, they mask) in the scheduler. By running the entire warp in lockstep a lot of circuitry is saved.

Strangely enough, I know this, and said as much.

Except GPUs do actually have branch instructions. Or, at least, the one I'm currently helping write the compiler for does. The branch instruction looks exactly like any other CPUs branch instruction. It sets the "next PC" for that thread to the branch target and masks the thread off. If/when the actual PC becomes equal to the desired PC in the future the thread is masked back on.

You can also work with explicit thread masks in any general register. That's how short nested if/then/else is most efficiently done. But it can be done with branch instructions, same as usual.

Quote

This works very well for some workloads and works absolutely not at all for others. Thus a comparison to CPUs simply makes no sense.

It never works worse than a CPU with the same number of in-order cores as the CPU has warps. Which there can be more than 100 of these days. Sometimes it works up to 32x better than that.

Quote

Quote
But if you want overall throughput on a parallelizable task, or something that is inherently many individual tasks, such as web serving, then it works out better to use the same number of transistors to make a lot of simple pipelined single issue (maybe dual at most) cores. Like in the Xeon Phi. Like in the Cavium ThunderX with 48 Aarch64 cores running at 2 GHz

This maybe works on paper, but has never worked out in practice, be it Niagara / Ultrasparc Tx, or Cavium; the former just had too poor performance, though it actually did work in some niche applications; the latter is plagued by performance issues and bugs all over the chip. Cavium said the next generation will surely fix them all. Notice how a few years ago many vendors made a big noise abou AArch64 in the data center - they're all pretty silent now...

I tried the Cavium 96 core machine for building llvm (one of my primary workloads). It had 65% of the performance of the top EC2 c4.8xlarge machine for 30% of the price. Pretty good.

It beat the next lower c4.4xlarge in absolute terms.

brucehoult · « **Reply #41 on:** April 30, 2017, 10:00:17 pm »

Quote from: dom0 on April 30, 2017, 05:18:26 pm

"That CISC decoder is so complex and power hungry and needs so many transistors" is pretty much just 90s RISC propaganda these days. Heck, it was the 90s when Intel, AMD and some other companies(!) showed off that the ISA on the frontend doesn't matter and that the popular marketing, even academic, opinion that x86 will never be as high performance as RISC (POWER/SPARC/Alpha), because it's "literally impossible to use pipelining" or make procedure calls fast or any other disproven "fact", is bull*****.

You just told me, in the previous post: "Similarly if you look at a POWER or SPARC core the decoder takes a large part of the logic transistor budget; in almost all CPU cores -regardless of ARM/x86/POWER/SPARC- nowadays the decoder (the parts you can identify as one) is usually the largest logic part (even on the floor plan)."

Please do try to choose one or the other.

Quote

The truth is that it doesn't matter and that's precisely the reason why all RISC architectures were flushed down the drain by 2000 in the mainstream server and workstation markets. A bit later all long-pipeline µarchs were flushed after them (some pun intended). Pipe didn't clog from those, either.

As I said in my original post, when you have massively OOO (e.g. 300 instructions in flight, register rename up the wazoo) the decode doesn't matter.

It also, however, means that you're using ten times the transistors and power to get twice the single core performance. This makes sense in desktop systems. It doesn't make sense in either phones or the datacentre.

GeorgeOfTheJungle · « **Reply #42 on:** April 30, 2017, 10:19:23 pm »

Quote from: dom0 on April 30, 2017, 05:22:56 pm

the program probably spends most of the time contending over that lock, not doing the calculations.

Note to the casual reader: that ^^^ is bad and renders the fp benchmark ~ useless (for the intended purpose). Other (hopefully better) tests show that the Intel i7 Mac was only 3x (not 12x) times faster than the $8 Orange Pi Zero in this silly average fp/doubles benchmark.

JuanAG · « **Reply #43 on:** May 01, 2017, 09:14:08 pm »

Quote from: GeorgeOfTheJungle on April 30, 2017, 07:04:30 am

Quote from: JuanAG on April 30, 2017, 03:56:34 am
[..] someone has said that 1 GHz of an arm CPU is less than a 1 GHz of an x86, no, it is not true [..]

I have to disagree there ^^^, the billions of transistors in an Intel CPU are there for a reason: performance. ARM cores are much simpler to save energy (and that's why they are in every smartphone in the planet). If you ever wanted an ARM core to perform as well as an Intel you'd have to put in it nearly as many transistors as an Intel has but then would as well require as many watts.

More transistors doesnt mean more performance (a 4 cilinder car can have more horsepower than a V8)

1/3 or the transistors are for the graphics, 1/5 is for the L3 cache and if you see the core itself only a small amount of transistors are the real power in form of ALUs

So more transistors if they doesnt do anything useful doesnt mean nothing, a supercomputer is doing the same operations again and again, it focuses on few instructions and much of wich will have the same cycles in arm or intel so it performs same clock-to-clock comparation, only in "rare" operations like sqrt or similar exists a difference but arm is more flexible and it will offer more instructions to match better the code that x86 which doesnt have all the operations + increment so in x86 will be two cycles, the op and later the inc.
Or the registers in wich case x86 has few so you has to store and load more from the caches, the pi cpu has 29 64bits (are 32 but three are reserved) Xserires, another 29 (32 really) 32 bits Wseries and 30 if i remember right (of 32) 128 bits Vseries (these are special and can be broken to 64 bits Dseries or 32 bits Sseries or 16 bits Hseries) meaning that you almost dont need to store because you run out of registers and load later as most as you has to do in x86

The compiler is the key, i do asm my self so i can see first hand that optimized code for both performs the same and if you take time to optimize to an insane level arm will do better

ARM is not only for low power applications, intel had to say goodbye to the atoms line because it performs worst drawing more power than an ARM cpu

Mechatrommer · « **Reply #44 on:** May 02, 2017, 12:46:31 am »

isnt graphics/cache transistors are there for performance? try running altium designer, photoshop, autodesk inventor in arm, if arm can do as well as intel, i'll buy arm computer. everybody try to be an engineer here.

GeorgeOfTheJungle · « **Reply #45 on:** May 02, 2017, 07:50:44 am »

This is very interesting: "What does an instruction set designer do when she is given moar transistors?"

And this too:

GeorgeOfTheJungle · « **Reply #46 on:** May 02, 2017, 07:57:35 am »

Quote from: JuanAG on May 01, 2017, 09:14:08 pm

The compiler is the key, i do asm my self so i can see first hand that optimized code for both performs the same and if you take time to optimize to an insane level arm will do better

No doubt if you do that you may find some sweet spots, but that's not how things work. Take a linux performance benchmark instead to test the system as a whole and you'll see what are those billions of transistors there for.

E.g. try to fine tune the benchmarks above and make them run faster in the OPi, I doubt very much you'll find a sweet spot there because the caches are less clever and much smaller, the buses are narrower and much slower, the RAM as well, etc. IOW the ARM has less hardware features (=>less transistors =>less power) of the kind that make a CPU perform better.

dom0 · « **Reply #47 on:** May 02, 2017, 08:58:18 am »

Quote from: blueskull on May 02, 2017, 06:30:51 am

Quote from: cleaningOut on May 02, 2017, 01:19:45 am
Quote from: Mechatrommer on May 02, 2017, 12:46:31 am
isnt graphics/cache transistors are there for performance? try running altium designer, photoshop, autodesk inventor in arm, if arm can do as well as intel, i'll buy arm computer. everybody try to be an engineer here.

None of those programs have been ported to ARM, and the companies involved have basically no incentive to do so.

His point is that iGPU is useful, but that's only true for single CPU desktop users. If you run a server or a multi CPU system, the extra iGPU resource is wasted. Of course, you can use OpenCL on it, but then why not just use a GPU card which is more powerful and much cheaper than the iGPU per TFLOPS.

That's why there is no iGPU on server chips (sockets R and such).

cprobertson1 · « **Reply #48 on:** May 02, 2017, 09:53:05 am »

Crikey! I take my eyes off the thread for two minutes and it runs away without me

For my application, there doesn't appear to be too many FPOs - so a higher MIP rate would probably be advantageous - having said that - there are several nested layers of inefficiencies slapped on top of that, so I'd find it hard to characterise my code in terms of resources actually required - which was why I was fairly generic in my "just need clockspeed" requirement - which is vague and pretty misleading as far as these things go

My assumption here is that "once running it runs as fast as the processor allows" - so a faster processor ? faster computation (negative losses between RAM <-> CPU <-> HDD of course): so rather than building an optimum system, I had figured a more generic system with a high overall clock speed would be adequate (obviously there are diminishing returns here: especially since my code is not optimised to eke the most out the available hardware).

As I mentioned, I've tested this with a beowulf cluster of 4x Raspberry Pi - and I've just done a rough benchmark this weekend to figure out the effects of additional processing units.

[Processors]	[Total]	[Increase]	[% increase]	[% efficiency]
1	252500	252500	0	100
2	497400	244900	97	97
3	484900	240000	98	95
4	472800	232800	97	92

*Average tests run per hour

These ran for three hours on each setup and the average calculated (and rounded to the nearest 100) - as you can see there's an approximate 3% loss with each additional stage (there is a chaotic factor involved since it's a genetic algorithm - sometimes it gets lucky and the code is very effective against the opponent-code, in which case the test proceeds quickly; other times it gets unlucky and they sit there circling each other and resolves in a tie after a certain number of cycles.)

Im not sure if there are additional bottlenecks in the network switch that will cause the losses to escalate further with increasing node count.

I also suspect the control node may at some point struggle to generate sample code - though at the moment it takes 30 seconds to generate 8000 code samples (working out at 960'000 tests, which takes ~2 hours to perform under the beowulf cluster) so I don't think it's too stressed.

I have also discovered that staggering the testing start times on the control nodes reduces the amount of data going back to the control node at any given time: which has a disproportionate effect on overall speed/efficiency (the slave nodes sit idle while waiting for orders - if the control node receives data from all the slave nodes simultaneously it deals with the first one and the rest get buffered until it's finished. The result is that each slave node has to wait for every other slave node before it - time that could be spent processing; by staggering them there is no buffering and no more than one slave is sitting idle at any particular time (unless one finishes late while the next one finishes early)

Anyhoo, I'm going to read through all this again - quite a lot of interesting stuff here!

dom0 · « **Reply #49 on:** May 02, 2017, 11:22:18 am »

Hm, from a software perspective this sounds like an interesting project to write a (small) JIT for your bytecode (I assume). They're great fun to build


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: The search for a (CHEAP) supercomputer... (Read 18367 times)

Share me