Author Topic: LTSpice - performance investigation - more threads≠more speed!  (Read 5321 times)

0 Members and 1 Guest are viewing this topic.

Offline daqqTopic starter

  • Super Contributor
  • ***
  • Posts: 2315
  • Country: sk
    • My site
Hi guys,

I've been playing around with LTSpice and trying to speed it up and I thought I'd show you the results.

The tests were done on a reasonably hefty machine (i7-8700, running at ~4.3GHz (dynamic clock change, but this is the max. frequency), 16 GB of RAM, SSD disk), running Windows 10 with all of the latest updates. Nothing of note aside from LTSpice was running during the test. LTSpice version XVII(x64), Dec 5 2019. I was running the simulation attached. The simulation itself is a small-ish sized simulation of a DC DC converter, pretty much the test jig for the device, only with saturable inductors (I'm not sure I've done them properly) and a second filter - basically I was just messing around. The circuit itself is not very interesting and was just used for testing the simulation. Simulation file:
* LTSpiceTest.asc (4.9 kB - downloaded 105 times.)

The total duration of the simulation was taken from the generated .log file, line: Total elapsed time: xx.xxx seconds. The CPU load was eyeballed from Windows Task Manager. The base load was around 1-2%. There is some repeatability there, but it's not exactly 100%. Trying one simulation 3 times yielded 43 to 45 seconds simulation time with the same settings. Not sure why.

I've tried setting different amounts of Max threads in the Settings->SPICE->Max threads for the SPICE system.

Results:
The fastest simulation is with 4 or 5 threads. The CPU I use has 6 cores, both supporting 2 threads, but this is not the same as having 12 cores. At 6 threads the speed drops off and afterwards it slows down further. Interestingly enough, at 12 threads, the simulation speed is comparable to the speed we get by using just 1 thread. I also tried the Alternate solver, but it's around 2x slower than the normal one.

My guess is that at some point the overhead overtakes the benefits of having more threads available. Due to the fact that there are only 6 real cores on the chip, adding more threads does not add FLOPS or anything similar. I noticed this with OpenEMS as well - when dealing with serious number crunching, the extra 'thread' on the CPU is useless, or worse it adds to the overhead without actually contributing anything good.

Simulation duration:


CPU load:


I have also tried other options when running @5 threads.
Tweak               Sim duration
Reference value            45
Thread priority: High         44.039
Disable 1st order compression   43.399
Matrix compiler: Pseudocode   45.306
Matrix compiler: (off)         46.865
Disable 2nd order compression   44.708
ASCII files               118.79

The only tweaks/settings that improve simulation speed are setting a high thread priority and disabling compression. Anything else worsens the result.

Conclusion:
Don't assume that more threads will get you the best results. Experiment a lot. I'm going to be doing lots of simulations now, as such I have reason to try to tweak it as much as possible. Please take all of this with a grain of salt.

Best regards,

David

Raw data:
Normal solver:
Threads   Sim duration   CPU load
1   54.632   13
2   49.359   25
3   45.628   36
4   44.055   46
5   43.911   58
6   45.898   70
7   49.625   80
8   54.504   92
9   54.195   93
10   55.585   93
11   52.423   92
12   52.626   91


Alt solver:
Threads   Sim duration   CPU load
1   97.08   14
2   81.327   25
3   73.169   35
4   73.022   46
5   68.785   58
6   72.091   69
7   77.426   80
8   89.132   91
9   85.823   91
10   80.848   91
11   81.095   91
12   84.796   92

« Last Edit: January 04, 2020, 09:55:13 pm by daqq »
Believe it or not, pointy haired people do exist!
+++Divide By Cucumber Error. Please Reinstall Universe And Reboot +++
 
The following users thanked this post: MagicSmoker, thm_w

Offline daqqTopic starter

  • Super Contributor
  • ***
  • Posts: 2315
  • Country: sk
    • My site
Re: LTSpice - performance investigation - more threads≠more speed!
« Reply #1 on: January 04, 2020, 09:45:44 pm »
Note: There are a few lines in the LTSpice help on Threads:
Quote
The maximum number of threads is set to the maximum number of concurrently executing threads your OS and CPU hardware supports. The actual number used in any given simulation depends on the nature of the circuit. While LTspice introduces stochastically cooled threads1 to define the state of the art of multi-threaded SPICE simulation, there are some circuits that cannot benefit from multiple threads. LTspice will not tie up additional threads that don't in the end make the simulation run faster.
Basically, the solver limits the number of threads to a useful amount, but it probably does not take into account the actual number of real cores on your CPU.
Believe it or not, pointy haired people do exist!
+++Divide By Cucumber Error. Please Reinstall Universe And Reboot +++
 

Offline SilverSolder

  • Super Contributor
  • ***
  • Posts: 6126
  • Country: 00
Re: LTSpice - performance investigation - more threads≠more speed!
« Reply #2 on: January 04, 2020, 09:54:06 pm »

It is fairly normal to see the best results on servers when using all cores minus one for processing, leaving the last core to handle operating system requests.  Your findings appear to be par for the course.
 
The following users thanked this post: tooki

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15432
  • Country: fr
Re: LTSpice - performance investigation - more threads≠more speed!
« Reply #3 on: January 04, 2020, 10:07:40 pm »
Yes, this is a common issue for multi-threaded code. Several factors here:
1/ Resource contention (especially RAM access): beyond some point, more threads can lead to worst performance if the code makes heavy access to RAM, especially on shared memory;
2/ Code simply not well written for multi-threading: here, there are many many ways of multi-threading simulation (/heavy number crunching) code. Not all will actually distribute work evenly across threads;
3/ As you also noted, hyper-threading in Intel CPUs is cool but is certainly no replacement for more cores. The FPU, for instance (and if I'm not mistaken), will be shared by all threads running on the same core, and simulation programs use FPUs intensively.

LTSpice seems to select the max number of threads by default, which I think (and you proved) is silly. Detecting the real number of cores is not hard, so they should probably add this in a future release.

« Last Edit: January 04, 2020, 10:09:29 pm by SiliconWizard »
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7243
  • Country: pl
Re: LTSpice - performance investigation - more threads≠more speed!
« Reply #4 on: January 04, 2020, 10:42:51 pm »
Clearly the workload isn't multiprocessing-friendly. For the faster solver, you see right away that enabling two cores reduces execution time by 10%, which isn't even close to the 50% you would expect in ideal case. It's creating a lot of additional work for the CPU just to split the original work in half.

A secondary factor is probably turbo boost slowing down the CPU as more cores become active. Good to know, at any rate.
 

Offline ogden

  • Super Contributor
  • ***
  • Posts: 3731
  • Country: lv
Re: LTSpice - performance investigation - more threads≠more speed!
« Reply #5 on: January 04, 2020, 11:19:45 pm »
The FPU, for instance (and if I'm not mistaken), will be shared by all threads running on the same core, and simulation programs use FPUs intensively.
I am afraid that you are mistaken. Time when FPU was separate from CPU is long gone. Today each core is capable of computing floating point instructions on it's own.
 
The following users thanked this post: tooki

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15432
  • Country: fr
Re: LTSpice - performance investigation - more threads≠more speed!
« Reply #6 on: January 04, 2020, 11:20:02 pm »
Also note that not all simulations will multi-thread as well in LTSpice. Haven't really made experiments as you did, but certainly I've noticed that simulating some circuits would only take like 20% CPU or less (on a 6-core 12 threads core i7 here), meaning most threads must be idle most of the time waiting for the others to complete, whereas other circuits would take more like 70-80%. I would suspect that the internal LT models for some of their components, especially the switching-mode converters, scale well, whereas the base solver doesn't (which wouldn't be surprising at all).
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15432
  • Country: fr
Re: LTSpice - performance investigation - more threads≠more speed!
« Reply #7 on: January 04, 2020, 11:22:07 pm »
The FPU, for instance (and if I'm not mistaken), will be shared by all threads running on the same core, and simulation programs use FPUs intensively.
I am afraid that you are mistaken. Time when FPU was separate from CPU is long gone. Today each core is capable of computing floating point instructions on it's own.

You didn't understand what I said then. We were discussing "hyper-threading", and I was saying that each core has only 1 FPU (not that they all shared only 1). Meaning that if you go for more threads than the number of cores, you'll get into contention with the FPUs. So on a 6-core, 12-thread hyper-threaded CPU, using more than 6 threads if dealing with heavy FPU stuff is counter-productive.
 
The following users thanked this post: ogden

Offline ogden

  • Super Contributor
  • ***
  • Posts: 3731
  • Country: lv
Re: LTSpice - performance investigation - more threads≠more speed!
« Reply #8 on: January 04, 2020, 11:30:35 pm »
So on a 6-core, 12-thread hyper-threaded CPU, using more than 6 threads if dealing with heavy FPU stuff is counter-productive.
Yes, I managed to miss that you talk about hyperthreading instead of multicore. Sorry. We obviously agree then. Hyperthreading is kinda "context switch accelerator", it does not do any good for computing (floating or integer).
 

Offline daqqTopic starter

  • Super Contributor
  • ***
  • Posts: 2315
  • Country: sk
    • My site
Re: LTSpice - performance investigation - more threads≠more speed!
« Reply #9 on: January 06, 2020, 10:24:15 am »
Thanks for the opinions and advice, I'll try those settings, could be useful.

I would be interested though on how does the speed scale with more cores than I have available. Anyone with a RYZEN?  ;D
Believe it or not, pointy haired people do exist!
+++Divide By Cucumber Error. Please Reinstall Universe And Reboot +++
 

Online magic

  • Super Contributor
  • ***
  • Posts: 7243
  • Country: pl
Re: LTSpice - performance investigation - more threads≠more speed!
« Reply #10 on: January 06, 2020, 10:33:04 am »
It doesn't :P

Not in your test case, at any rate.
 

Online Berni

  • Super Contributor
  • ***
  • Posts: 5031
  • Country: si
Re: LTSpice - performance investigation - more threads≠more speed!
« Reply #11 on: January 06, 2020, 12:51:13 pm »
You might be misunderstanding the point of hyperthreading.

Having two threads on one CPU core doesn't suddenly mean that the core can execute twice as much work. There is no FPU sharing, all modern CPUs typically have a dedicated FPU built into each core (Tho AMD did share a FPU between two cores on the old FX series right before the radio silence leading up to Ryzen). The core is still doing the same amount of work per clock cycle, except that it is now its jumping back and forth between the two treads, pausing the inactive tread in the mean time.

The only reason that hyperthreading exists is to provide an extra 'waiting line' for work for when one thread stalls (due to waiting for memory or a overburdened cpu resource). Without hyperthreading waiting for memory access can stall the core and force it to wait, but with hyperthreading the CPU will switch to executing the other thread in that time, then switch back to the first thread once its ready to continue work. So hyperthreading only improves performance when the particular workload stalls the core often enough.

However both threads still use the same core cache. So by running two threads the workload can outgrow the small fast cache and force data to be moved up into the larger but slower L2 and L3 caches. This causes a lot of memory access stalls for the core and hurts performance (even hyperthreading wont help when both threads stall at the same time).

Also FPUs are becoming much more powerful in modern CPUs, but not in a easily scalable way. These days they can do things like 16 floating MAC calculations per cycle, but only if you use the new AVX512 instructions that feed these 16 math operations trough the math core in parallel. Using existing old multiply x86 instructions that feed the FPU one operation at a time still run at the same speed as they 15 years ago. The FPU could calculate faster but can't if its not fed the data fast enough by the rest of the core logic.

You could solve the problem by giving each of the two threads not only separate registers but also separate execution logic and separate cache.... but at that point you have essentially doubled almost all of the core so you might as well make a 12 core CPU instead. You could skip duplicating the big powerful AVX512 FPU to save a decent amount of transistors and share it between two cores (Its clearly fast enough to keep up if smaller floating instructions are used) and that's how you get to the AMD FX design (Was the first reasonably priced 8 core x86 CPU). Why Intel didn't do that with AVX512 might be to avoid getting crippled in multithreaded math benchmarks(Like the AMD FX did), or they just didn't care about transistor count so much.
 

Offline ogden

  • Super Contributor
  • ***
  • Posts: 3731
  • Country: lv
Re: LTSpice - performance investigation - more threads≠more speed!
« Reply #12 on: January 06, 2020, 03:18:33 pm »
You might be misunderstanding the point of hyperthreading.
Which post you are talking about? Please be specific, use quotes. I re-read every post that mentions hyperthreading and they all in general agree with what you say.  :-// :-// :-//
 

Online Berni

  • Super Contributor
  • ***
  • Posts: 5031
  • Country: si
Re: LTSpice - performance investigation - more threads≠more speed!
« Reply #13 on: January 06, 2020, 06:36:56 pm »
You might be misunderstanding the point of hyperthreading.
Which post you are talking about? Please be specific, use quotes. I re-read every post that mentions hyperthreading and they all in general agree with what you say.  :-// :-// :-//

It was mostly about the FPU being shared between threads being a factor in the slowdown. Technically true since both threads use the same FPU but has nothing to do with performance.

The more cores/threads thing being a pretty brute force solution in pumping more calculations trough the FPU while using modern AVX instructions is more efficient at that. But i have no idea how well spice lends itself to using those, nor how much do existing spice tools use the more advanced math instructions.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf