Author Topic: EEVblog #726 - Dual Xeon Video Editing Machine Build (Read 76696 times)

TiN · « **Reply #50 on:** March 23, 2015, 04:48:02 pm »

Just to be picky, don't count CPU cores with HT

3770K is 4core 8thread one, and 2630's are 6core12thread

Also it's best to use all blue slots on LGA2011, since it's quad channel platform. But to be honest, gain going from 2 channel to four is usually not much, you can add memory later.

I'd rather recommend looking into GPU-aided video editing software, since software which do processing on GPU (e.g. NVIDIA CUDA), as that is usually way faster than any CPU. Vegas seem to support GPGPU since version 11 Pro, but I am not sure exactly, as I had use only Adobe stuff for edit/encoding.

v81 · « **Reply #51 on:** March 23, 2015, 04:51:53 pm »

Quote from: dexters_lab on March 23, 2015, 04:02:29 pm

Quote from: v81 on March 23, 2015, 03:46:39 pm

RE: Hyperthreading, it has no significant benefit in this scenario.
You will likely find it has no effect on the i7 if you turn it off there too.

really? on my i7 vegas is 30% slower rendering with HT turned off.

Doesn't sound right, though i guess it depends on the encoder being used.
Most encoders show very little benefit from HT.

Grapsus · « **Reply #52 on:** March 23, 2015, 05:15:30 pm »

the quantity of BS and ignorance I see in this thread is shocking. Before comparing carrots and oranges please take a look at the motherboard manual :

http://www.supermicro.com/products/motherboard/Xeon/C600/X9DAE.cfm

Now look page 1-10 "System block diagram", see how half of the memory is connected to CPU1, the other half to CPU2 and a bus called QPI between CPU1 and CPU2 ?

Then you should read about how this link between the CPUs works :

http://www.intel.com/content/dam/doc/white-paper/quick-path-interconnect-introduction-paper.pdf

How do you think it works when CPU2 is trying to access a location that is physically connected to CPU1 ? It has to send a request to CPU1 through QPI, CPU1 then decodes the request routes it to its memory controller which fetches the data from the memory and the response is sent back through QPI. QPI is a complete networking protocol with routing, OSI layers etc. Add to this cache handling, resource sharing, contention. Therefore there is a performance penalty when a CPU tries to use a location that's not directly attached to it.

A single consumer CPU like a Core i7 has only one memory controller, it doesn't have this interconnect problem.

On the other side this dual processor Xeon with QPI is a server architecture whose aim is to go way beyond 128 GB of RAM. This architecture called NUMA is a trade-off in order to fit so much RAM in a single machine. The gain is memory quantity and the drawback is non-uniformity in access times. However on servers, the penalty is minimal. The typical workload is 100s of processes in parallel, each serving a separate client or a request. A NUMA-aware OS pins each task to a particular CPU and then allocates pages of memory attached to the same CPU. So a process scheduled like this only sees local memory and it's all nice.

This is all very well described even on wikipedia:

http://en.wikipedia.org/wiki/Non-uniform_memory_access

Quote

The benefits of NUMA are limited to particular workloads, notably on servers where the data are often associated strongly with certain tasks or users.

Imagine you try to run a task trying to take advantage of all the cores and all the memory at the same time, the OS will be forced to give away RAM from different physical CPUs to the same process. If the process is not aware that accessing location A from thread X is faster than from thread Y performance will be shit.

But how can you blame consumer video rendering software for not being aware of this server architecture ? This is nonsense.

The real solution is either to find a video software that explicitly supports non uniform access architectures or to find a way to share the workload among several separate processes. The later might be possible for video, with one process rendering the first half of the movie and another one rendering the second half.

victor · « **Reply #53 on:** March 23, 2015, 05:32:46 pm »

At this point, maybe you should get another camera that shoots at 60fps or another that shoots at 50fps and render without changing the frame rates (you said it a few times that this is the problem). I'm not sure if is viable money wise, but maybe it's you best option, if you doing this on the sole purpose of rendering speed.

Or maybe its time to look for another editing software and re-learn your workflow. Almost any editing software do the job you need, since you only need the basics, just trim and join clips looking at waveforms, and text overlay.

DanielS · « **Reply #54 on:** March 23, 2015, 06:03:00 pm »

Quote from: v81 on March 23, 2015, 04:51:53 pm

Quote from: dexters_lab on March 23, 2015, 04:02:29 pm
really? on my i7 vegas is 30% slower rendering with HT turned off.
Doesn't sound right, though i guess it depends on the encoder being used.
Most encoders show very little benefit from HT.

Poorly written and threaded encoders scale poorly. Video transcoding is one of those algorithms that should scale quite well with processing power since the problem can be split into multiple mostly independent streams and stitched backed together. During the Handbrake/x264 part of Dave's benchmarks, Handbrake appeared to be doing a fairly good job at keeping all 24 threads busy.

With 24 threads in a dual-CPU NUMA configuration and only half the memory channels available on each socket, many things can go wrong unless the software has NUMA-aware optimizations, such as setting affinity for threads to specific CPUs and then making sure all the RAM they need comes from that CPU as much as possible, duplicate shared tables between CPUs, etc. to avoid snooping and data transfers over the QPI bus.

There are already tons of multi-threading performance gotchas under normal single-socket circumstances and I imagine many more than the few above come in when you add multi-socket NUMA in the equation.

coppice · « **Reply #55 on:** March 23, 2015, 06:14:32 pm »

EATX would be an odd size for a modern server board. Most follow the SSI forum outlines.

Razor512 · « **Reply #56 on:** March 23, 2015, 06:26:29 pm »

The issues you are having with hyperthreading is due to how the OS manages the threads. Normally for proper use of hyperthreading, the OS will make sure that all physical cores are in use before a logical core is used, and the most demanding threads should never share a physical core unless there is no better alternative.
If the OS does not fully support this, then you end up with cases where 1 physical core is stuck having to handle 2 heavy threads while there are other available physical cores being underutilized.

A good example of this (while not intel), is the crappy AMD FX where they went with the core module crap. The core modules work better than hyperthreading as most of the processing hardware is duplicated, thus fewer components are shared. (hyperthreading shares all of the processing hardware of the core, but allows it to load up 2 threads thus all of the cores processing components can be better utilized)

http://www.extremetech.com/computing/138394-amds-fx-8350-analyzed-does-piledriver-deliver-where-bulldozer-fell-short/2

As you can see, if 2 threads fall on the same core module (thus some processing components are begin shared), then you get an overall drop in performance. If 1 core in each core module is disabled, then the overall performance of a 4 threaded workload increases. If all 8 cores are turned on, and the OS is left to decide which cores are used, then you get a non optimal use of the 4 threads, and some end up sharing a core module, and overall performance suffers.

The performance impact of 2 threads falling on the same core in a hyperthreaded environment is far higher since no processing components are duplicated, so the worst case scenario is an application not being able to fully use all of the physical and logical cores, and then mistakenly assigning 2 important threads to the same physical core.
Lucky for intel is they did not try to hide the fact that the logical cores were not true cores, thus it is easy to identify which is a physical core, and which is a logical core (provided they release updated drivers to properly inform the OS). On the AMD side, they did do the same, instead they count all of them as a full core, thus there are many cases where the FX chips are perform far slower than the Phenom II (older gen).

The Core i7 3770k is 4 physical cores, and 4 logical cores (which share all of the resources of the physical core, thus they only offer a benefit if there are parts of the core not being fully used)

On a side note, if you use adobe premiere pro, then 32GB will be the minimum that you need to fully use each core since it allocates memory for each core in order to make sure that you you do not run memory limits. (64GB is the bare minimum for 4K video)
Adobe premiere pro is designed around supporting a large number of cores, and is regularly used in dual and quad CPU environments.

Razor512 · « **Reply #57 on:** March 23, 2015, 06:36:07 pm »

Quote from: dexters_lab on March 23, 2015, 04:02:29 pm

Quote from: v81 on March 23, 2015, 03:46:39 pm

RE: Hyperthreading, it has no significant benefit in this scenario.
You will likely find it has no effect on the i7 if you turn it off there too.

really? on my i7 vegas is 30% slower rendering with HT turned off.

Hyperthreading on the 3rd and 4th gen intel CPU's offers a 20-30% performance boost on certain multithreaded workloads by allowing the physical core to be more fully utilized. There are some some workloads where it will lower performance (e.g., in some games (such as far cry 4 or second life).

Hyperthreading offers its full benefit pretty much only in video editing, and and software based 3D modeling (with lesser improvements in other workloads), provided the application is optimized for a hyperthreaded environment (e.g., adobe premiere pro will get a very good performance boost from it).

eV1Te · « **Reply #58 on:** March 23, 2015, 06:40:41 pm »

Does anyone have real life experience with other codecs for mastering and intermediate storage? I have used Avids DNxHD previously but I never benchmarked it and I never checked if it had good multithreaded performance.

The main conclusion from these tests are that Handbrake can utilize the 12 cores much better than Vegas built-in codecs. Hence the best thing is to move away from Vegas own codecs and use one that can handle 12 cores.

FYI: this motherboard has a 128 Gb/s QPI interconnect for the memory between the 2 cpu's, more than enough for this task.

Grapsus · « **Reply #59 on:** March 23, 2015, 07:11:31 pm »

Quote from: eV1Te on March 23, 2015, 06:40:41 pm

FYI: this motherboard has a 128 Gb/s QPI interconnect for the memory between the 2 cpu's, more than enough for this task.

Yeah, sure, all subtle multi-processing issues boil down to bandwidth... Like when one CPU screws the other CPU local cache (1ns latency) so it has to perform a new RAM access (200 ns).

rodcastler · « **Reply #60 on:** March 23, 2015, 07:16:52 pm »

And Dave keeps experimenting with our brains by adding the scope-flipping guy into these videos just in case anyone notices.... Well played Dave!

FHR · « **Reply #61 on:** March 23, 2015, 07:17:44 pm »

Hey Dave,

So to begin with: Hyperthreading IS SUPPORTED in Linux and Windows XP+! It would be stupid if Win 7 wouldn't support HT.
The second thing I noticed and cringed when you builded the PC: The memory configuration. You are mixing RDIMMs and UDIMMs. That just won't work (very well). The RDIMMs (or Registered ECC) are basically "the ones with many chips rotated in weird directions on them".
According to SuperMicro document: "* Mixing of Registered and Unbuffered DIMMs is not allowed. * Mixing of ECC and non-ECC is not allowed."

I think your main problem lies in software. The thing you are using for your video editing obviously wasn't designed to run on 12 threads, let alone 24 threads. That's why when you turned HT off it ran slightly faster. The handbrake was blazing fast, because it was designed to use multiple threads.

Sources:
Windows 7 hyperthreading http://www.informationweek.com/software/operating-systems/windows-7-boosts-hyper-threading-support/d/d-id/1079573?
Mixing UDIMMs, RDIMMs http://www.supermicro.com/support/resources/memory/X9_DP_memory_config.pdf

eV1Te · « **Reply #62 on:** March 23, 2015, 08:26:22 pm »

Quote from: Grapsus on March 23, 2015, 07:11:31 pm

Quote from: eV1Te on March 23, 2015, 06:40:41 pm
FYI: this motherboard has a 128 Gb/s QPI interconnect for the memory between the 2 cpu's, more than enough for this task.

Yeah, sure, all subtle multi-processing issues boil down to bandwidth... Like when one CPU screws the other other CPU local cache (1ns latency) so it has to perform a new RAM access (200 ns).

Sure you are correct, there is much more to how bottlenecking occurs.

What I tried to say was simply that video encoding is not a memory-bandwidth intensive operation (about 300 MB/s if encoded in real-time for 50 fps full HD).

This was confirmed by the good performance seen in handbrake on the Xenon setup (more than twice as fast as the i7) which would otherwise bottleneck as well.

Fungus · « **Reply #63 on:** March 23, 2015, 09:06:13 pm »

Quote from: 3roomlab on March 23, 2015, 07:21:23 pm

if you use multiple edit sources eg: sample clips, old clips, sources from diff kinds of cameras, etc, you will find that VEGAS is more tolerant of being able to drop almost all kinds of clips into the edit window. PD cannot do that, or maybe i did not explore much when i tried it. but if you do try dropping all these onto the timeline and scrubbing, 25fps, 30fps, 30DF, 60DF, 50fps, etc ... some mobile phones shoot at 30fps + 1 frame (dont ask me why) and it seems vegas have a soft buffer that accomodates all these very well. in PD, once you select a frame rate/size, you have a hard time dropping in edit material which do not conform to its edit base frame size, esp weird size pictures

Never seen that but it wouldn't a problem for most people - everything is at 1080p.

It does warn you if the frame rate of a clip doesn't match the frame rate of your project (for video quality reasons I assume) but it's a preference that you can turn off.

m100 · « **Reply #64 on:** March 23, 2015, 09:48:45 pm »

Not that I've done much of this before but if it is an acceptable configuration i'd take one of the processors out, then reinstall W7 and the app from scratch and see what the video rendering performance is. It could be quicker.

(added to save Dave the trouble!)

cbmuser · « **Reply #65 on:** March 23, 2015, 09:58:20 pm »

Quote from: Grapsus on March 23, 2015, 05:15:30 pm

the quantity of BS and ignorance I see in this thread is shocking. Before comparing carrots and oranges please take a look at the motherboard manual

And, yet, you are absolutely wrong. This is not a NUMA issue at all but a typical showcase of Hyperthreading having a negative impact on compute-intensive software. This is neither new nor unexpected unless you never worked in an HPC environment. The effect is observed quite often and even before Dave mentioned it in the video, my first thought was to disable Hyperthreading.

Hyperthreading can sometimes have negative impact in cases where all cores are already saturated by parallel tasks because it interferes with the scheduler of the operating system. We have seen similar problems on the SGI UV-1000 (512 cores, 1024 threads, 2 TiB RAM with external NUMAlink5): A process that was taking around 500 cores would run much slower when Hyperthreading was enabled since the scheduler of the kernel created a much higher overhead which was visible in a high system load. Simply disabling Hyperthreading on-the-fly through sysfs (echo 0 > /sys/devices/system/node/node*/cpu${cpu}/online) immediately resolved the issue and all cores were 100% user load. And the code that was used actually used numactl to specifically bind CPUs and memory to avoid NUMA performance impacts when accessing memory on a remote node.

Adrian

FHR · « **Reply #66 on:** March 23, 2015, 10:04:46 pm »

Video rendering is not HPC. HT is useful for video rendering. It's not like he's running Hadoop on his machine. But I agree that this has nothing to do with NUMA. It's badly designed software.

I have dual L5520 machine at home (HT ON) and I don't have any problems with "normal" software (Adobe AE, FFMpeg, Chrome, Mass Effect, Maya) running swiftly and without hiccups. I run linux, though AE refused to work so I have Win 7 on second SSD.

warp_foo · « **Reply #67 on:** March 23, 2015, 10:56:46 pm »

Quote from: DanielS on March 23, 2015, 04:28:17 pm

Another thing to keep in mind: LGA2011 CPUs have a quad-channel memory controller so unless you put four similar DIMMs on each CPU, you are only enabling half of the RAM bandwidth each CPU is capable of. This could be a massive bottleneck when all 24 threads are enabled.

I was going to post a similar comment. While I agree with Dave that you may not *need* a huge amount of RAM, filling in all of the RAM channels (and equivalent quantity of RAM per CPU...) will be a win. 32GB using 8x 4GB DIMMs across both CPUs is probably reasonable. I doubt going from 1333 to 1600 will actually make a noticeable difference, but quad channel access will.

m

Grapsus · « **Reply #68 on:** March 23, 2015, 11:24:32 pm »

Quote from: cbmuser on March 23, 2015, 09:58:20 pm

Quote from: Grapsus on March 23, 2015, 05:15:30 pm
the quantity of BS and ignorance I see in this thread is shocking. Before comparing carrots and oranges please take a look at the motherboard manual

And, yet, you are absolutely wrong. This is not a NUMA issue at all but a typical showcase of Hyperthreading having a negative impact on compute-intensive software. This is neither new nor unexpected unless you never worked in an HPC environment. The effect is observed quite often and even before Dave mentioned it in the video, my first thought was to disable Hyperthreading.

Hyperthreading can sometimes have negative impact in cases where all cores are already saturated by parallel tasks because it interferes with the scheduler of the operating system. We have seen similar problems on the SGI UV-1000 (512 cores, 1024 threads, 2 TiB RAM with external NUMAlink5): A process that was taking around 500 cores would run much slower when Hyperthreading was enabled since the scheduler of the kernel created a much higher overhead which was visible in a high system load. Simply disabling Hyperthreading on-the-fly through sysfs (echo 0 > /sys/devices/system/node/node*/cpu${cpu}/online) immediately resolved the issue and all cores were 100% user load. And the code that was used actually used numactl to specifically bind CPUs and memory to avoid NUMA performance impacts when accessing memory on a remote node.

Adrian

First, I never said that HT wasn't a problem. I demonstrated why it was foolish to compare the number of cores and the amount of RAM between a single CPU and a NUMA architecture and also that is was equally foolish to state that some software sucks because it can't fully take advantage of such a machine all by itself.

Second, where do you actually provide facts to prove that the problem is HT only and not NUMA or HT+NUMA ? Bro, that's so cool you worked in HPC environment where HT was a problem on a totally different workload, good for you.

FHR · « **Reply #69 on:** March 23, 2015, 11:33:34 pm »

And yet the software "sucks". It can't use that many threads. It wasn't designed to run on that many threads (the "640kB of RAM would be enough for everyone" approach?). Even if the software was NUMA-optimized, there is no guarantee it will provide better performance.

elgonzo · « **Reply #70 on:** March 24, 2015, 02:07:29 am »

Quote from: warp_foo on March 23, 2015, 10:56:46 pm

Quote from: DanielS on March 23, 2015, 04:28:17 pm
Another thing to keep in mind: LGA2011 CPUs have a quad-channel memory controller so unless you put four similar DIMMs on each CPU, you are only enabling half of the RAM bandwidth each CPU is capable of. This could be a massive bottleneck when all 24 threads are enabled.

I was going to post a similar comment. While I agree with Dave that you may not *need* a huge amount of RAM, filling in all of the RAM channels (and equivalent quantity of RAM per CPU...) will be a win. 32GB using 8x 4GB DIMMs across both CPUs is probably reasonable. I doubt going from 1333 to 1600 will actually make a noticeable difference, but quad channel access will.

m

While i would not speak about a massive bottleneck, allowing the CPU to utilize all 4 memory channels (4 or 8 DIMMs per CPU) will provide some performance gain (around the 10% mark, i speculate), which is going to be a better performance improvement than swapping the two existing DIMMs per CPU with two faster DIMMs.

By the way, Dave has 3 DIMMs of each kind. He could actually use all of them (3 DIMMs per CPU), which would allow the CPUs to interleave accross 3 memory channels. While not as fast as 4-channel interleave, it would still be a little faster than just 2-channel interleave. So, why did Dave not use all the DIMMs he already has?

warp_foo · « **Reply #71 on:** March 24, 2015, 02:26:29 am »

Quote from: elgonzo on March 24, 2015, 02:07:29 am

Quote from: warp_foo on March 23, 2015, 10:56:46 pm
Quote from: DanielS on March 23, 2015, 04:28:17 pm
Another thing to keep in mind: LGA2011 CPUs have a quad-channel memory controller so unless you put four similar DIMMs on each CPU, you are only enabling half of the RAM bandwidth each CPU is capable of. This could be a massive bottleneck when all 24 threads are enabled.

I was going to post a similar comment. While I agree with Dave that you may not *need* a huge amount of RAM, filling in all of the RAM channels (and equivalent quantity of RAM per CPU...) will be a win. 32GB using 8x 4GB DIMMs across both CPUs is probably reasonable. I doubt going from 1333 to 1600 will actually make a noticeable difference, but quad channel access will.

m

While i would not speak about a massive bottleneck, allowing the CPU to utilize all 4 memory channels (4 or 8 DIMMs per CPU) will provide some performance gain (around the 10% mark, i guess), which is more than just keep running with two faster DIMMs.

By the way, Dave has 3 DIMMs of each kind. He could actually use all of them (3 DIMMs per CPU), which would allow the CPUs to interleave accross 3 memory channels. While not as fast as 4-channel interleave, it would still be a little faster than just 2-channel interleave. So, why did Dave not use all the DIMMs he already has?

I don't think this is correct, three channel RAM went out with the X58 chipset. The E5 Xeon is quad channel, so populating all of the blue or black DIMM slots would be appropriate. Intel's website does say a 'maximum' of four channels, but all of the data I've dug up suggests the C604 chipset works best with quad channels.

YMMV...

m

Skimask · « **Reply #72 on:** March 24, 2015, 02:43:15 am »

Look at all the anecdotal evidence and sample sizes of one!
Overall, a highly educated crowd

elgonzo · « **Reply #73 on:** March 24, 2015, 02:58:06 am »

Quote from: warp_foo on March 24, 2015, 02:26:29 am

I don't think this is correct, three channel RAM went out with the X58 chipset.

The chipset does not contain the memory controller anymore. The memory controller is integrated in the CPU.
And the Xeon E5-26xx v2 family (Ivy Bridge-based) does indeed support interleaving memory accesses across 3 channels. read here page 16 and following; (The linked PDF is from Fujitsu, but other server/workstation vendors will likely also have similar documentation. Note that for dual CPU configurations, using 3 channels per CPU is also sometimes called "6 channels"; HP uses/d this nomenclature, for example).

(Side note: 3-way interleave is not possible with the 12-core CPU family members. Since those CPUs have two memory controllers, you will always end up with an even number of channels for interleaving...)

Quote

The E5 Xeon is quad channel, so populating all of the blue or black DIMM slots would be appropriate. Intel's website does say a 'maximum' of four channels, but all of the data I've dug up suggests the C604 chipset works best with quad channels.

As said before, the memory controller is inside the CPU. The C602/C604 chipset has nothing to do with memory...
Anyway, of course using all 4 memory channels works best. But that was not my point. The point was that Dave has already three of the 8GB DIMMs and three of the 4 GB DIMMs. So why did he only use two of each type?

~~(There might be a hypothetical chance that the SuperMicro BIOS blocks such configurations - i don't know-, but what would be the purpose of such?)~~ EDIT: As expected, the BIOS of the SuperMicro board does indeed support interleaving across 3 channels. From the board manual, page 4-13: "This feature selects from the different channel interleaving methods. The options are: Auto, 1 Way, 2 Way, 3 Way, and 4 Way."...

plexus · « **Reply #74 on:** March 24, 2015, 03:55:57 am »

It could be your Sony software. one benefit of software aimed at pros is a focus on performance. not always, granted. but I would suspect the Sony software because Handbrake is >2x as fast. That is a clue that the machine is working to performance but the Sony software is not taking advantage of it. I am not a coder but from what I understand the software has to be coded to take advantage of CPU efficiencies and options. For $40US can you sign up for month with Adobe CC and download and try the latest version of Premiere. it would be worth it just to trouble shoot.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: EEVblog #726 - Dual Xeon Video Editing Machine Build (Read 76696 times)

Share me