Author Topic: The search for a (CHEAP) supercomputer... (Read 18750 times)

cprobertson1 · « **on:** April 28, 2017, 01:29:55 pm »

Good Afternoon folks!

DISCLAIMER - I am going to use the "CPU Clk spec" rather than the "CPU FLOPS spec" as it's an easier number to work with since I'm not just doing floating point operations: the test is a complex algorithm which runs assembly code in a virtual machine (inspired by Corewars, who remembers that one!?) - once the virtual machine is set up it only needs to run once: and you can run as many tests as you want in it, fairly quickly - a la corewars! - the problem is that I don't think that directly translates into the CPU FLOPS spec =/

Sorry in advance!

I have an interesting project on the go just now - a code-generating genetic algorithm.

It's fairly simple in practice - it generates some code, mutates it, and tests it - the best part about this process is subdivisible - once a process receives it's orders it carries out the tests and reports back the results: which is useful because it only runs on one core at a time (some dufus borked the hyperthreading code (Surprise!!! It was me

)).

At the moment it's running on a 4-Beowulf Cluster of Raspberry Pi (equivalent speed... of... ~3.2GHz?) - which works out at ~84 tests per second (~300'000 per hour)

The upshot of this single-core behaviours is that it maxes out whichever core it's using, crunching the numbers as fast as it can - you can just split the job into separate parts and assign them their own processes (in separate cores) and voila! Works (more or less) equally well whether it's a single multicore CPU or multiple CPU on multiple boards networked together (obviously the latter has more losses due to larger bottlenecks in transfer speeds between the master and slave nodes)

In short, there is a direct correlation between CPU clock speed and the number of calculations performed (with some losses due to inefficiencies summing together - just means it's not a linear correlation).

But... I want MORE POWER!

I can obviously expand the size of the Raspberry Pi Cluster, maybe copying Dave's idea of using Orange Pi instead of the regular ones... but the problem is that the cost of these units isn't insignificant - and I'm convinced I can do better!

At the moment I have a single (kinda crappy) 16-port ethernet switch - which is linked to the 4x Raspberry Pi: so in principle, I can run up to 16 RPi, which works out at about... ~11Ghz in total (plus some losses... okay, from now on just assume whenever I list a CPU speed that it includes "the losses from various stuff happening

")

So, that works out at £36/GHz - and I reckon I can do better than that!

On the other hand, I've just got the results back from my local University whom I rented some time on their supercomputer from - that's not a very good sentence - that worked out at £~50 / 1000 core hours (which... I don't know how to convert that to GHz... but it's got a peak performance of 38 Teraflops peak performance - which might actually be cheaper than having my own miniature, super-low-end supercomputer (but with the disadvantage that I can't put it in my livingroom and use it to display pretty images of evolving code 24/7

- BEST PICTURE FRAME EVER)

Going down Daves route with the Orange-Pi gives me about £20/GHz - BUT I reckon I can do even better!

With the Orange Pi, each unit comes with extra stuff - GPUs, USB busses, GPIO, and some of them even have integral WiFi! Score! - but, you're definitely going to be paying for that extra hardware - I, in principle need the bare minimum to run tinylinux (I might not even need to go that far), meaning a CPU, RAM and Ethernet support

Does anybody have any suggestions on how I can up the total processing power?

I'd like to aim for £15-20/GHz if I could get that far - but I'm struggling to come up with any ideas that don't revolve around ASICs or FPGAs (the former of which are expensive and the latter of which involves programming my own CPU which doesn't sound fun... especially considering I don't know diddly-squat about FPGAs

)

Many thanks!

CJay · « **Reply #1 on:** April 28, 2017, 01:37:59 pm »

Curious as to how that stacks up against the offering from Amazon (even the free offering)?

CM800 · « **Reply #2 on:** April 28, 2017, 01:40:58 pm »

Hello,

I'd suggest checking out the NanoPi NEO

Allwinner H3 (quad 1.2GHz)

for under $8 (256MB) or $10 (512MB)

http://www.friendlyarm.com/index.php?route=product/product&product_id=132

daqq · « **Reply #3 on:** April 28, 2017, 01:55:38 pm »

You could investigate the options of renting computing power from companies - you can 'rent' any number of Amazon or Google (or other cloud) solutions, some of whom can pack a pretty punch.

For a simulation I rented some monster computer from them, left it running for an hour, crunching numbers.

Tom45 · « **Reply #4 on:** April 28, 2017, 02:06:09 pm »

If your code can be ported to OpenCL you could then run it on highly parallel video cards.

Also see: https://petewarden.com/2014/08/07/how-to-optimize-raspberry-pi-code-using-its-gpu/

GeorgeOfTheJungle · « **Reply #5 on:** April 28, 2017, 02:17:32 pm »

Quote from: CM800 on April 28, 2017, 01:40:58 pm

Hello,

I'd suggest checking out the NanoPi NEO

Allwinner H3 (quad 1.2GHz)

for under $8 (256MB) or $10 (512MB)

http://www.friendlyarm.com/index.php?route=product/product&product_id=132

And the Orange Pi Zero https://aliexpress.com/store/1553371

NiHaoMike · « **Reply #6 on:** April 28, 2017, 03:28:44 pm »

Put together a cheap Xeon Phi setup? Or port the code to run on GPU?
If you just want a lot of small, cheap nodes, be on the lookout for super cheap smartphones.

macboy · « **Reply #7 on:** April 28, 2017, 05:28:54 pm »

Maybe not what you are looking for but, this winter I put together a Xeon system with 16 cores (32 threads) at 3.2 GHz, and it cost me roughly $1k CAD (~£600). The CPUs are 'officially' 2.9 GHz, turbo up to 3.8 GHz, but will run flat out at 3.2 GHz with all cores fully loaded as long as they don't overheat (they don't). So 16 x 3.2 is 51.2 GHz by your metric. 600 £/51.2 = 600/51.2 = 11.7 £/GHz. If you account for a 30% increase in available CPU power with hyperthreading, then that's £9/GHz. And you'll find that "1 GHz" worth of Xeon CPU will do far more work than "1 GHz" of RPi CPU. This is true even if you just use basic x86 instructions, but even more true if you can optimize for SSE, etc.

The machine consists of a pair of E5-2690 CPUs (8C/16T each) on an ATX form factor Supermicro X9DRL-iF dual CPU motherboard, eight sticks of 8 GB Samsung DDR3R (registered, ECC) RAM (quad channel each CPU), two giant SNK-P0048AP4 heatsinks, plus the usual bits like a beefy power supply, case, fans, SSD, etc. For very basic server stuff you can get away with the on-board video, but to even use it as a basic workstation you'll want a 'real' video card, even if a low end one. Everything was purchased used for much less than the original retail price. These used server components have a weak resale market so the prices are really low. E.g. $120 USD for the eight sticks of 8 GB RAM. The motherboard is the biggest ticket item. If you can deal with a strange large form factor then you have a better chance at a bargain; I paid more for standard ATX size.

With this, you can also encode 4k video faster than realtime and other similar feats that a RAM-starved RPi farm can't do.

Marco · « **Reply #8 on:** April 28, 2017, 07:43:12 pm »

Unless the genetically optimized program is supposed to operate on wide vectors I don't quite see how a GPU or Xeon Phi is supposed to help.

GeorgeOfTheJungle · « **Reply #9 on:** April 28, 2017, 08:07:23 pm »

Quote from: evb149 on April 28, 2017, 05:58:14 pm

Then again I have heard the word "HEATSINK" used many occasions relative to rasbperry PIs so it may be that even just running the CPU full out is too much without active cooling, and if you add GPU stress to that as well you might certainly need it. Investigate.

That's very true. 100% cpu load on all cores at once plus no heatsinks and cpu throotling will kick in very very soon on the RPis and OPis.

NiHaoMike · « **Reply #10 on:** April 29, 2017, 01:58:25 am »

Here's a pretty large list of cross platform benchmarks:
http://www.7-cpu.com/
Per core, modern x86 CPUs are not that much faster than modern ARM cores. A single A53 (used in the Raspberry Pi 3) scores 880 and 1600 and a single Kaby Lake core 4900 and 4700. It would be easy to build a desktop with 10x the CPU power of a Pi 3 but 50x not so much.

If you can find cheap but decent smartphones (requires patience), those are likely to do very well in terms of performance per watt and performance per dollar for tasks that do not run well on GPUs. For example, I have a cluster of 4 quad core smartphones (three Moto Es and a ZTE Speed) that cost $70 and use less than 10W total mining altcoins. A x86 system to emulate those 4 smartphones will definitely cost more than $70 (at least when new) and use a lot more than 10W, especially if there's ARM on x86 emulation to deal with.

GeorgeOfTheJungle · « **Reply #11 on:** April 29, 2017, 08:09:44 am »

Quote from: evb149 on April 29, 2017, 03:07:26 am

So I would be surprised if a $800 PC doesn't blow past the RAM BW of pretty much anything else you can find in cheap ARM SBC space.

Yes the billions of transistors in an Intel CPU are there for a reason, but don't overlook that 800 freedom kopecs buy you more than 100 Orange Pi Zeros. That's 400 pretty decent ARM cores @ 1.6GHz.

brucehoult · « **Reply #12 on:** April 29, 2017, 09:43:03 am »

Quote from: Marco on April 28, 2017, 07:43:12 pm

Unless the genetically optimized program is supposed to operate on wide vectors I don't quite see how a GPU or Xeon Phi is supposed to help.

GPUs are a very different kind of parallel operation to SIMD like SSE or AVX. With those everything has to run exactly the same instruction in the calculation, every time.

In a GPU you write the program as if each calculation is on a different CPU, the same as for ARM or x86. You can have if/then/else, loops, whatever. The hardware runs the threads in groups (typically 32) called a "warp". All the threads in one warp run the same instruction at the same time. If some of the threads don't want that instruction then they sit idle for a while.

For example, if your code says 'if (x > y){mx = x; foo--;} else {mx = y; foo+=2;} printf("%d", mx);' and x is in fact greater than y in 90% of the threads. They will all run the test together. And then 90% of the threads will do "mx = x; foo--" while the other 10% sit idle. And then 10% of the threads will do "mx = y; foo+=2" while the other 90% sit idle. And then all 100% of the threads will together do the printf. (yes, OpenCL has printf)

If you have a loop and some of the threads finish the loop quickly but others are still going ... the ones that finished sit idle until the others exit from the loop.

So you should try to arrange your code so that as high a percentage of each group of 32 threads as possible are taking the same execution path. But if they diverge from time to time it's ok, as long as they join together again soon.

The top modern graphics cards such as the NVIDIA 1080 TI now have about 3584 CPU cores in 112 warps running at about 1.5 GHz, for $600. Even if you have really bad luck with idle threads in a warp, that's still a *lot* of pretty fast CPUs for $600. And if you get lucky or write your code carefully .. it's 32x more.

A bytecode interpreter should run tolerably well on a GPU. It's not ideally suited, by any means, but it's not *horrible* You're probably going to drop back to one thread in a warp at a time running the actual guts of the bytecode, but the fetch and decoding of the bytecodes should happen with all 32 cores in parallel. If your interpreter loop has say 90% of the instructions for housekeeping and 10% running the actual operation (pretty typical) then I'd guess you'd end up using on average about 500 - 800 of the 3584 cores at any given clock cycle. That's a lot of MIPS.

For sure it will be much better than $600 worth of Raspberry Pis :-)

cprobertson1 · « **Reply #13 on:** April 29, 2017, 02:03:19 pm »

Oooft! Lots of great info here!

At the moment the program can be set to compile CoreWars code, OR my own system inspired by it - though they work in the same way - my way is a little more modern BUT requires a little more processing power (i.e it's slightly less efficient). As a result, the flow chart below uses "pMARS" as the assembly compiler/tester instead of my own system (and there are more efficient MARS compilers out there anyway - I haven't optimised that side of things yet!)

Relating to the program flow:

The code all runs under an overseer application (python) which starts off all the other modules and then acts as a watchdog timer and memory manager - deleting old files as required.

The overseer starts the compiler, using a series of initial parameters - this generates a series of test-code, each in it's own text file (typically <500Bytes, limited to about 4.2kB) - how these are stored depends on the config file (they can be stored in RAM or on a particular drive)

The overseer then starts the scheduler which splits up the population for testing, again based on the configuration - deciding which cores to start processes in.

On top of that, (but still under the overseer) is the controller, which controls the data flow between the master node (which does everything except testing) and any slave nodes. I know that's not the right terminology but you understand exactly what I mean so I'm not going to stop using it

The controller just sends test-orders and any required code-to-be-tested to the slaves, and receives the results back (which it in turn reports back to the compiler which then "breeds" the test-code) and the process restarts.

The test just involves generating a virtual memory core with assembly commands in it and then executing the code while alternating between a number of pointers: after a number of conditions are met (usually the number of cycles) a scoring algorithm is run and the "winner" reported back to the controller: there isn't particularly hard on the PC - in fact, CoreWars was originally released in 1984 and it ran fine then

- the RAM requirements are minimal - each simulated "Core" requires 448KB for the data in the core, and another 50kB for the supporting code - it's tiny - which is what I rely on when I run as many tests as I can!

BUT because of this it's quite a lot of high-level stuff - I'm not sure it'd be easy to optimise to run on a GPU =/ There are some optimisation tweaks I've done so far and I'll optimise it a little more as I go through it.

Those

Quote from: GeorgeOfTheJungle on April 28, 2017, 02:17:32 pm

Quote from: CM800 on April 28, 2017, 01:40:58 pm
Hello,

I'd suggest checking out the NanoPi NEO

Allwinner H3 (quad 1.2GHz)

for under $8 (256MB) or $10 (512MB)

http://www.friendlyarm.com/index.php?route=product/product&product_id=132

And the Orange Pi Zero https://aliexpress.com/store/1553371

Ah, that NanoPi might be exactly what I'm after: the cost:power ratio is pretty high - the Orange Pi Zero has higher power consumption and costs more - so the Neo may be better for my application in terms of cost.

As for cooling (and noise pollution

) - I'm thinking of simply submerging them in a tank of mineral oil (and maybe use an aquarium pump to add some flow to the oil) - provided it's low oxygen oil it shouldn't cause any corrosion problems.

16x NanoPi Neo works out at ~16 GHz of processing (including losses - it'll likely be higher) and that should be fairly expandable - so I can start slow and ramp up the power over time.

Not sure what I'll go for yet though - still researching it!

In the meantime I have a MiniPC from 2003 that I'll use for running 24/7 in the living room while I work on upping the processing power - it's a little slow but it will at least look pretty!

GeorgeOfTheJungle · « **Reply #14 on:** April 29, 2017, 04:11:32 pm »

To prove that at least for some things, such as this useless microbenchmark, $800 of arm cores are better/faster and less expensive than an intel CPU, here's a short program that spawns 4 threads. Each thread creates an own 16MB buffer and receives a common/global 16MB buffer, then averages every byte and saves it in its own buffer and then returns it.

My thousands of freedom kopecks i7 MacBook Pro gives:
jorge@unibody:~/kk$ gcc -O0 threads.c -o threads
jorge@unibody:~/kk$ time ./threads

real   0m0.113s
user   0m0.313s
sys   0m0.052s

And my $8 Orange Pi Zero:
pi@orangepi:~$ gcc -lpthread -O0 threads.c -o threads
pi@orangepi:~$ time ./threads

real   0m0.491s
user   0m1.700s
sys   0m0.170s

That's about 1/5 the speed for much less than 1/100 the price.

Code: [Select]

#include <pthread.h>
#include <stdlib.h>

#define BUF_SIZE 16*1024*1024
pthread_t t1, t2, t3, t4;

void* thread_proc (void* arg) {
    int i= BUF_SIZE;
    char* buffer= malloc(BUF_SIZE);
    while (i--) buffer[i]= (((char*) arg)[i] + buffer[i]) / 2;
    return buffer;
}

void create_threads (char* buffer) {
    pthread_create(&t1, NULL, thread_proc, buffer);
    pthread_create(&t2, NULL, thread_proc, buffer);
    pthread_create(&t3, NULL, thread_proc, buffer);
    pthread_create(&t4, NULL, thread_proc, buffer);
}

void join_threads () {
    void** r;
    pthread_join(t1, r), free(*r);
    pthread_join(t2, r), free(*r);
    pthread_join(t3, r), free(*r);
    pthread_join(t4, r), free(*r);
}

int main (void) {
    char* buffer= malloc(BUF_SIZE);
    create_threads(buffer), join_threads();
    free(buffer);
    return 0;
}

daybyter · « **Reply #15 on:** April 29, 2017, 10:27:31 pm »

https://www.eevblog.com/2016/10/15/eevblog-934-raspberry-pi-supercomputer-cluster-part-1/

JuanAG · « **Reply #16 on:** April 30, 2017, 03:56:34 am »

I am trying something similar and the pis are to big for me, really i doesnt need much stuff from it so i am trying with CPUs instead, this is very similar to the broadcom the pi has

https://www.olimex.com/Products/Components/IC/A64/

It is only 5€, more or less 5$

If the software can run parallel is the way to go, you put 20 or 30 on a PCB and you will get a huge performance for a little cost and with few watts of energy, believe me when i say that an x86 pc or server cant match this, for the same price with the arm you always will get more performance and less waste of energy

And up there someone has said that 1 GHz of an arm CPU is less than a 1 GHz of an x86, no, it is not true, most of the operations take the same amount of cycles and the arm a53 has SIMD instructions as well, they dont call sse or avx but exists and are for all you can imagine, add, sub, div, mul and with 128 bits, 64 bits (the native bits), 32 bits and more, from the type you choose you will have the numbers of pararell operations, if you choose 128 you will get half of 64 and so on

GeorgeOfTheJungle · « **Reply #17 on:** April 30, 2017, 07:04:30 am »

Quote from: JuanAG on April 30, 2017, 03:56:34 am

[..] someone has said that 1 GHz of an arm CPU is less than a 1 GHz of an x86, no, it is not true [..]

I have to disagree there ^^^, the billions of transistors in an Intel CPU are there for a reason: performance. ARM cores are much simpler to save energy (and that's why they are in every smartphone in the planet). If you ever wanted an ARM core to perform as well as an Intel you'd have to put in it nearly as many transistors as an Intel has but then would as well require as many watts.

GeorgeOfTheJungle · « **Reply #18 on:** April 30, 2017, 07:27:23 am »

Quote from: cprobertson1 on April 29, 2017, 02:03:19 pm

[..] Ah, that NanoPi might be exactly what I'm after [..]

IIRC those nanopis don't have proper voltage regulator PMICs nor proper pcb designs so they overheat too easily, ISTR I've read that in the armbian forums (ARMBian is the best distro for these RPi "clones").

These are the only urls I have at hand right now, but there are others IIRC.
https://forum.armbian.com/index.php?/topic/1351-h3-board-buyers-guide/
https://linux-sunxi.org/FriendlyARM_NanoPi_NEO#Voltage_regulators_.2F_heat

brucehoult · « **Reply #19 on:** April 30, 2017, 10:59:49 am »

Quote from: GeorgeOfTheJungle on April 30, 2017, 07:04:30 am

Quote from: JuanAG on April 30, 2017, 03:56:34 am
[..] someone has said that 1 GHz of an arm CPU is less than a 1 GHz of an x86, no, it is not true [..]

I have to disagree there ^^^, the billions of transistors in an Intel CPU are there for a reason: performance. ARM cores are much simpler to save energy (and that's why they are in every smartphone in the planet). If you ever wanted an ARM core to perform as well as an Intel you'd have to put in it nearly as many transistors as an Intel has but then would as well require as many watts.

That would be true if ARM and x86 were similar instruction sets, but they are very different and x86 is *much* harder to decode and requires more transistors, more area, more power.

In the actual execution engines, right, there should be little difference at equivalent performance levels.

So it all depends on the ratio of the size of the decode part to the size of the execution part.

The entire point of using GPUs for "compute" tasks is that the instruction decode (and execution control) is shared between 32 or so CPU cores. Current GPUs have anywhere from 256 to nearly 4000 cores, running at up to 1.5 GHz or so. That shows how expensive decode and control is compared to ALUs and registers.

On the highly out-of-order implementations in high end x86 since the Pentium Pro it turns out that decode is a relatively small percentage of the total circuitry. However, while these OOO implementations give you more performance, they don't give you more performance in proportion to the transistors you spend on them. If you need single-threaded performance above all then it's the only way to go.

But if you want overall throughput on a parallelizable task, or something that is inherently many individual tasks, such as web serving, then it works out better to use the same number of transistors to make a lot of simple pipelined single issue (maybe dual at most) cores. Like in the Xeon Phi. Like in the Cavium ThunderX with 48 Aarch64 cores running at 2 GHz on a chip -- you can rent ARM64 cloud servers with 96 cores and 12 GB RAM for $0.50/hour here https://www.packet.net/bare-metal/servers/type-2a/

On those kinds of masses of simple cores, the expense of x86 decode does seriously hurt you.

Even the accumulated ARM/Thumb/Thumb2/Jazelle/ThumbEE mess hurts you quite a lot. Those Cavium CPUs are 64 bit *only*. No legacy 32 bit ARM instruction sets.

RISC-V is looking good for the near future too. It's pretty equivalent to Aarch64, but with Thumb2 code density, and specifically designed to be cheap and fast to decode. 1.6 GHz quad core 64 bit SoCs more or less like the A53 are expected out around the end of this year, and people are prototyping 1000 core SoCs in the lab.

Fungus · « **Reply #20 on:** April 30, 2017, 12:06:56 pm »

Quote from: GeorgeOfTheJungle on April 29, 2017, 04:11:32 pm

To prove that at least for some things, such as this useless microbenchmark, $800 of arm cores are better/faster and less expensive than an intel CPU, here's a short program that spawns 4 threads. Each thread creates an own 16MB buffer and receives a common/global 16MB buffer, then averages every byte and saves it in its own buffer and then returns it.

That's a meaningless test if the OP is interested in double precision floating point.

And... there's the problem. This question can't be answered without knowing exactly what the program does.

eg. For single precision floating point a GPU will destroy anything else in its path. Those things can easily have thousands of cores.

For double precision floating point a desktop CPU will easily win against ARM chips...

...up to a point. If you're allowed to use hundreds of ARM chips then eventually you'll overtake the desktop.

GeorgeOfTheJungle · « **Reply #21 on:** April 30, 2017, 02:55:14 pm »

Quote from: Fungus on April 30, 2017, 12:06:56 pm

For double precision floating point a desktop CPU will easily win against ARM chips...

Right you are. There you go:

My thousands of freedom kopecks i7 MacBook Pro:
jorge@unibody:~/kk$ gcc -O0 threads.c -o threads
jorge@unibody:~/kk$ time ./threads

real   0m0.719s
user   0m2.238s
sys   0m0.098s

The $8 Orange Pi Zero:
pi@orangepi:~$ gcc -lpthread -O0 threads.c -o threads
pi@orangepi:~$ time ./threads

real   0m9.754s
user   0m7.940s
sys   0m25.370s

Or about 14x the speed at more than 100x the price.

Code: [Select]

#include <pthread.h>
#include <stdlib.h>
#include <stdio.h>

#define BUF_SIZE 16*1024*1024
pthread_t t1, t2, t3, t4;

double* nu_buffer () {
    double* buffer= malloc(BUF_SIZE);
    int i= BUF_SIZE/8;
    while (i--) buffer[i]= (random()-random())/(random() + 1.0);
    return buffer;
}

void* thread_proc (void* arg) {
    int i= BUF_SIZE/8;
    double* buffer= nu_buffer();
    while (i--) buffer[i]= (((double*) arg)[i] + buffer[i]) / 2.0;
    return buffer;
}

void do_threads_stuff (double* buffer) {
    pthread_create(&t1, NULL, thread_proc, buffer);
    pthread_create(&t2, NULL, thread_proc, buffer);
    pthread_create(&t3, NULL, thread_proc, buffer);
    pthread_create(&t4, NULL, thread_proc, buffer);
    void** r;
    pthread_join(t1, r), free(*r);
    pthread_join(t2, r), free(*r);
    pthread_join(t3, r), free(*r);
    pthread_join(t4, r), free(*r);
}

int main (void) {
    double* buffer= nu_buffer();
    do_threads_stuff(buffer);
    //free(buffer); return 0;
}

GeorgeOfTheJungle · « **Reply #22 on:** April 30, 2017, 03:06:44 pm »

Quote from: brucehoult on April 30, 2017, 10:59:49 am

That would be true if ARM and x86 were similar instruction sets, but they are very different and x86 is *much* harder to decode and requires more transistors, more area, more power.

True, but they also have much larger caches, built-in crypto/AES engines, turbo-boost, wider and faster buses, hyperthreading, virtualization extensions, etc. All that counts as well, an ARM with all these bells and whistles would very likely need *nearly* as many transistors as the Intel.

In any case those aren't my words, I heard Sophie Wilson herself say exactly that, and I'm just repeating her words as a parrot hahaha.

NiHaoMike · « **Reply #23 on:** April 30, 2017, 04:34:21 pm »

Quote from: brucehoult on April 30, 2017, 10:59:49 am

Even the accumulated ARM/Thumb/Thumb2/Jazelle/ThumbEE mess hurts you quite a lot. Those Cavium CPUs are 64 bit *only*. No legacy 32 bit ARM instruction sets.

I remember reading that Apple plans to phase out 32 bit support from iOS. About what % advantage would that yield?

dom0 · « **Reply #24 on:** April 30, 2017, 05:11:36 pm »

Pretty much the cheapest option is to find either complete datacenter pulls or reassamble from pulled parts of the sandy/ivy bridge generation that are being phased out since a year and a half or so. The CPUs (E5-2650 is the by far most common SKU) dipped as low as 20 $ each (8C/16T 2.8 GHz or so), pretty much everything else on such a system will be more expensive (even the case).

Quote from: brucehoult on April 30, 2017, 10:59:49 am

Quote from: GeorgeOfTheJungle on April 30, 2017, 07:04:30 am
Quote from: JuanAG on April 30, 2017, 03:56:34 am
[..] someone has said that 1 GHz of an arm CPU is less than a 1 GHz of an x86, no, it is not true [..]

I have to disagree there ^^^, the billions of transistors in an Intel CPU are there for a reason: performance. ARM cores are much simpler to save energy (and that's why they are in every smartphone in the planet). If you ever wanted an ARM core to perform as well as an Intel you'd have to put in it nearly as many transistors as an Intel has but then would as well require as many watts.

That would be true if ARM and x86 were similar instruction sets, but they are very different and x86 is *much* harder to decode and requires more transistors, more area, more power.

In the actual execution engines, right, there should be little difference at equivalent performance levels.

So it all depends on the ratio of the size of the decode part to the size of the execution part.

The entire point of using GPUs for "compute" tasks is that the instruction decode (and execution control) is shared between 32 or so CPU cores. Current GPUs have anywhere from 256 to nearly 4000 cores, running at up to 1.5 GHz or so. That shows how expensive decode and control is compared to ALUs and registers.

On the highly out-of-order implementations in high end x86 since the Pentium Pro it turns out that decode is a relatively small percentage of the total circuitry. However, while these OOO implementations give you more performance, they don't give you more performance in proportion to the transistors you spend on them. If you need single-threaded performance above all then it's the only way to go.

But if you want overall throughput on a parallelizable task, or something that is inherently many individual tasks, such as web serving, then it works out better to use the same number of transistors to make a lot of simple pipelined single issue (maybe dual at most) cores. Like in the Xeon Phi. Like in the Cavium ThunderX with 48 Aarch64 cores running at 2 GHz on a chip -- you can rent ARM64 cloud servers with 96 cores and 12 GB RAM for $0.50/hour here https://www.packet.net/bare-metal/servers/type-2a/

On those kinds of masses of simple cores, the expense of x86 decode does seriously hurt you.

Even the accumulated ARM/Thumb/Thumb2/Jazelle/ThumbEE mess hurts you quite a lot. Those Cavium CPUs are 64 bit *only*. No legacy 32 bit ARM instruction sets.

RISC-V is looking good for the near future too. It's pretty equivalent to Aarch64, but with Thumb2 code density, and specifically designed to be cheap and fast to decode. 1.6 GHz quad core 64 bit SoCs more or less like the A53 are expected out around the end of this year, and people are prototyping 1000 core SoCs in the lab.

You have some good info there and some completely wrong.

64-bit ARM cores tend to have _four_ frontends for the different ARM ISAs. Only some specialty cores can afford to implement only one decoder, e.g. Apple and Cavium do that. Modern x86 designs (Intel, AMD Zen) are massively out-of-order cores (~300 instructions in-flight), usually with an 8-way backend ("ports", "8-issue") and a massive register file, neither of which is done by any ARM core (if you would, you'd need pretty much the same power as everyone else, big surprise). The ISA implemented by a CPU plays a rather minor role in it's efficiency and power consumption today.

Similarly if you look at a POWER or SPARC core the decoder takes a large part of the logic transistor budget; in almost all CPU cores -regardless of ARM/x86/POWER/SPARC- nowadays the decoder (the parts you can identify as one) is usually the largest logic part (even on the floor plan).

The reason for the increased ALU efficiency in GPUs comes from putting many ALUs under the control of one frontend and one scheduler, and having exactly no dynamic optimization, be it register renaming, OooE or branch prediction (any branching, actually; GPUs do not branch, they mask) in the scheduler. By running the entire warp in lockstep a lot of circuitry is saved. This works very well for some workloads and works absolutely not at all for others. Thus a comparison to CPUs simply makes no sense.

Similarly simple cores are totally fine for linear, ALU heavy workloads and are applied there (PEZY, A2, Fujitsu), but suck balls at general purpose application.

Quote

But if you want overall throughput on a parallelizable task, or something that is inherently many individual tasks, such as web serving, then it works out better to use the same number of transistors to make a lot of simple pipelined single issue (maybe dual at most) cores. Like in the Xeon Phi. Like in the Cavium ThunderX with 48 Aarch64 cores running at 2 GHz

This maybe works on paper, but has never worked out in practice, be it Niagara / Ultrasparc Tx, or Cavium; the former just had too poor performance, though it actually did work in some niche applications; the latter is plagued by performance issues and bugs all over the chip. Cavium said the next generation will surely fix them all. Notice how a few years ago many vendors made a big noise abou AArch64 in the data center - they're all pretty silent now...

Oracle tells that the newest shiniest massively threaded SPARC cores are much much better now, but I bet that in general purpose use (i.e. not running a database using all those shiny DB-ISA extensions) any x86 core still bashes it's head in with ease, as always.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: The search for a (CHEAP) supercomputer... (Read 18750 times)

Share me