Author Topic: Oscilliscope memory, type and why so small? (Read 32028 times)

kcbrown · « **Reply #75 on:** March 11, 2017, 02:40:58 am »

Quote from: Someone on March 11, 2017, 02:31:39 am

FPGAs are more power efficient than a CPU for most tasks,

They are? I thought the underlying fabrics of FPGAs were relatively inefficient compared with straight-up hardware, of which a CPU is the latter.

I can see how an FPGA would be more efficient if the nature of the problem is such that a CPU would be a poor solution compared with a hardware solution. Certainly some of what we're discussing here qualifies, e.g. scope triggering.

I'll also be the first to admit, too, that CPUs have a substantial amount of overhead, so they can be overkill for many problems that FPGAs would be a much better solution. But I'm not sure about that "most tasks" claim ...

There's no way a CPU beats an ASIC for performance/watt, however, since the ASIC is directly optimized for the problem in question and doesn't have the fabric overhead that an FPGA has.

Someone · « **Reply #76 on:** March 11, 2017, 02:43:04 am »

Quote from: kcbrown on March 11, 2017, 02:28:29 am

Quote from: Someone on March 11, 2017, 01:09:40 am
What you miss is that most scopes to get the information to the screen stop the acquisition and have a substantial blind time where although the screen may be flickering away quickly at high FPS much of the information captured by the ADC is completely ignored.
Okay. And why is that?

Its like you're learning nothing through all these threads or you're intentionally pretending like you don't understand. Here is the entry level approach to it:
https://cdn.fswwwp.rohde-schwarz.com/pws/dl_downloads/dl_application/application_notes/1er02/1ER02_1e.pdf

Quote from: kcbrown on March 11, 2017, 02:28:29 am

Quote from: Someone on March 11, 2017, 01:09:40 am
What is more important is how much data actually gets to the screen,

Data on the screen is already in decimated form. So what exactly do you mean by "how much data actually gets to the screen" here?

More to the point, you can't see on the screen more data than the display itself is actually capable of displaying. If I have a 1000 pixel wide display, that means I can only display 1000 pixels of horizontal information. Period. Even if I have a million points of acquisition within the time frame represented by the screen, I can't display all of them. I can only display a 1000 pixel representation of them.

And that means that I might be able to take shortcuts, depending on the set of operations that are being performed in order to transform the data in the acquisition buffer into the data that is actually displayed. For instance, for the FFT, I can sample subsets of the acquisition data, and as long as my technique preserves the frequency domain statistical characteristics of the data in the acquisition buffer, what lands on the screen will be a reasonable representation of the FFT, right? Moreover, I can't display more than 1000 discrete frequency bins from the FFT at any given time, either, precisely because my display is limited to 1000 pixels of width.

The intensity grading mechanism is one of the mechanisms where you can't get away with subsampling, though, but at the same time, the transforms required for it are very simple and, more importantly, because the target of the transforms is the display, it means the target buffer can be relatively small, which means it can be very fast.

You cant take shortcuts, or you lose the information. If you just want a scope that shows the screen with a min/max envelope then have at it (and come back when you've looked at the computational cost of computing just that alone) but the market demands a graduated display these days. You dont know in advance which of the samples in the waveform has the interesting information in them, so the scope needs to parse (process) ALL of them. If you knew where the interesting information was then you'd have captured just that part alone.

nctnico · « **Reply #77 on:** March 11, 2017, 03:02:09 am »

In an intensity graded display you need to decimate so much that an individual sample doesn't even count. Therefore you can take huge shortcuts without misrepresenting a signal. And you don't really want an actual representation anyway because that would mean an 'area' with very little intensity is invisible.

Someone · « **Reply #78 on:** March 11, 2017, 03:11:01 am »

Quote from: nctnico on March 11, 2017, 03:02:09 am

In an intensity graded display you need to decimate so much that an individual sample doesn't even count. Therefore you can take huge shortcuts without misrepresenting a signal. And you don't really want an actual representation anyway because that would mean an 'area' with very little intensity is invisible.

No, individual hits do count and are shown, they are critically important for some applications. Going round in circles again:
https://www.eevblog.com/forum/testgear/100-000-waveformssec-versus-a-screen-refresh-rate-of-100-hz/?all
The second time I've had to link back to that discussion, if you'd like to add something feel free but repeating a narrow minded point of view as some absolute fact is wasting everyones time.
Singleton events are important, and in long memory you can see more information from a graduated display:

Someone · « **Reply #79 on:** March 11, 2017, 03:14:05 am »

Quote from: kcbrown on March 11, 2017, 02:40:58 am

Quote from: Someone on March 11, 2017, 02:31:39 am
FPGAs are more power efficient than a CPU for most tasks,

They are? I thought the underlying fabrics of FPGAs were relatively inefficient compared with straight-up hardware, of which a CPU is the latter.

I can see how an FPGA would be more efficient if the nature of the problem is such that a CPU would be a poor solution compared with a hardware solution. Certainly some of what we're discussing here qualifies, e.g. scope triggering.

I'll also be the first to admit, too, that CPUs have a substantial amount of overhead, so they can be overkill for many problems that FPGAs would be a much better solution. But I'm not sure about that "most tasks" claim ...

A vast array of applications have been targeted to PFGAs and they improve the power efficiency:
http://www.ann.ece.ufl.edu/courses/eel6686_15spr/papers/paper1a.pdf
https://www.altera.com/en_US/pdfs/literature/wp/wp-01173-opencl.pdf
http://www.cc.gatech.edu/~hadi/doc/paper/2014-isca-catapult.pdf
But the repetitive high throughput tasks which FPGAs and ASICs are so well suited to are exactly the sort of computations that are needed in an FPGA. They cant effectively replace all the computations but they can do the "heavy lifting".

kcbrown · « **Reply #80 on:** March 11, 2017, 03:25:37 am »

Quote from: Someone on March 11, 2017, 02:43:04 am

Quote from: kcbrown on March 11, 2017, 02:28:29 am
Quote from: Someone on March 11, 2017, 01:09:40 am
What you miss is that most scopes to get the information to the screen stop the acquisition and have a substantial blind time where although the screen may be flickering away quickly at high FPS much of the information captured by the ADC is completely ignored.
Okay. And why is that?
Its like you're learning nothing through all these threads or you're intentionally pretending like you don't understand. Here is the entry level approach to it:
https://cdn.fswwwp.rohde-schwarz.com/pws/dl_downloads/dl_application/application_notes/1er02/1ER02_1e.pdf

That document says:

Quote

While analog oscilloscopes just need to reset the horizontal system for the next electron beam sweep, digital oscilloscopes spend most of the acquisition cycle postprocessing the waveform samples [1]. During this processing time the digital oscilloscope is blind and cannot monitor the measurement signal.

and:

Quote

A natural reaction to the discussion so far could be to say "Let's build a faster digital oscilloscope with improved processing power and pipelined architecture". However, such a solution would require massive processing capabilities. For example, a digital oscilloscope with a 10 Gsample/s 8-bit ADC produces 80 Gbits of continuous data that must be processed and displayed every second. In addition, DSP filtering, arithmetic operations, analysis functions and measurements are often applied to the waveform samples which require additional processing power. Real-time processing with no blind time is currently not feasible for a digital oscilloscope in a laboratory environment.

That is thoroughly unenlightening, especially in the context of entry level scopes. It claims that the problem cannot be solved at all without explicitly stating why. All it gives is some vague claim about the sheer amount of processing power that would be required, without any recognition that processing requirements are immensely dependent upon the transforms that are required of the data in the first place. It thus implicitly makes the same claim that you do, that all data must be processed for every operation, or alternatively it engages in the fallacy that because there exist some operations that need to operate upon all the acquired data, that therefore all operations must do so.

I'll cut to the chase with a question: which operations, aside from the intensity grading operation, require processing of all the data in the buffer? Even the FFT doesn't require that.

The document also highlights what at first glance appears to be a shortcoming:

Quote

The measurement signal enters the oscilloscope at the channel input and is conditioned by attenuators or amplifiers in the vertical system. The analog-to-digital converter (ADC) samples the signal at regular time intervals and converts the respective signal amplitudes into discrete digital values called “sample points”. The acquisition block performs processing functions such as filtering and sample decimation. The output data are stored in the acquisition memory as “waveform samples”.

Why decimate the data before storing it to the acquisition memory instead of after?

That said, the block diagram they supply for the RTO architecture (page 13) is very much like what I'm envisioning here, but the memory implementation I have in mind would be double buffered so as to ensure that display processor reads from acquisition memory never collide with the writes coming from the acquisition engine.

Quote

You cant take shortcuts, or you lose the information.

You've already lost information just by the very fact that you're displaying a representation of millions of samples onto a 1000 pixel wide display!

Losing information for display purposes is unavoidable here. What matters is what the decimation process did to transform the acquired data into something for the display, and that is very operation-specific.

Quote

If you just want a scope that shows the screen with a min/max envelope then have at it (and come back when you've looked at the computational cost of computing just that alone) but the market demands a graduated display these days.

Yes, which is precisely why I said:

Quote

The intensity grading mechanism is one of the mechanisms where you can't get away with subsampling, though, but at the same time, the transforms required for it are very simple and, more importantly, because the target of the transforms is the display, it means the target buffer can be relatively small, which means it can be very fast.

So I'm already acknowledging that specific display requirement.

Quote

You dont know in advance which of the samples in the waveform has the interesting information in them, so the scope needs to parse (process) ALL of them. If you knew where the interesting information was then you'd have captured just that part alone.

You're conflating captured data with processed data. The triggering mechanism is what defines the points of interest, and that absolutely has to keep up with the sampling rate. I've never argued otherwise for that. The rest is a matter of how the UI interacts with the rest of the system.

kcbrown · « **Reply #81 on:** March 11, 2017, 03:26:46 am »

Quote from: Someone on March 11, 2017, 03:14:05 am

A vast array of applications have been targeted to PFGAs and they improve the power efficiency:
http://www.ann.ece.ufl.edu/courses/eel6686_15spr/papers/paper1a.pdf
https://www.altera.com/en_US/pdfs/literature/wp/wp-01173-opencl.pdf
http://www.cc.gatech.edu/~hadi/doc/paper/2014-isca-catapult.pdf
But the repetitive high throughput tasks which FPGAs and ASICs are so well suited to are exactly the sort of computations that are needed in an FPGA. They cant effectively replace all the computations but they can do the "heavy lifting".

That is absolutely fascinating. Thanks for digging those up! Good stuff there.

Someone · « **Reply #82 on:** March 11, 2017, 04:14:03 am »

Quote from: kcbrown on March 11, 2017, 03:25:37 am

That document says:
Quote
While analog oscilloscopes just need to reset the horizontal system for the next electron beam sweep, digital oscilloscopes spend most of the acquisition cycle postprocessing the waveform samples [1]. During this processing time the digital oscilloscope is blind and cannot monitor the measurement signal.
and:
Quote
A natural reaction to the discussion so far could be to say "Let's build a faster digital oscilloscope with improved processing power and pipelined architecture". However, such a solution would require massive processing capabilities. For example, a digital oscilloscope with a 10 Gsample/s 8-bit ADC produces 80 Gbits of continuous data that must be processed and displayed every second. In addition, DSP filtering, arithmetic operations, analysis functions and measurements are often applied to the waveform samples which require additional processing power. Real-time processing with no blind time is currently not feasible for a digital oscilloscope in a laboratory environment.
That is thoroughly unenlightening, especially in the context of entry level scopes. It claims that the problem cannot be solved at all without explicitly stating why. All it gives is some vague claim about the sheer amount of processing power that would be required, without any recognition that processing requirements are immensely dependent upon the transforms that are required of the data in the first place. It thus implicitly makes the same claim that you do, that all data must be processed for every operation, or alternatively it engages in the fallacy that because there exist some operations that need to operate upon all the acquired data, that therefore all operations must do so.

I'll cut to the chase with a question: which operations, aside from the intensity grading operation, require processing of all the data in the buffer? Even the FFT doesn't require that.

An FFT on the entire memory depth is extremely useful and desirable, the Agilent/Keysight X series dont do it and its a limitation of them. As I keep repeating endlessly, take some of the simple display examples and do your own maths on the required computational load. Just min/max to draw the envelope of the deep memory to a smaller display is computationally expensive enough to prove the point. Moving that sort of data throughput in and out of a CPU is not viable, and requires dedicated hardware resources to achieve the peak rates that current products are doing.

Quote from: kcbrown on March 11, 2017, 03:25:37 am

Quote
The measurement signal enters the oscilloscope at the channel input and is conditioned by attenuators or amplifiers in the vertical system. The analog-to-digital converter (ADC) samples the signal at regular time intervals and converts the respective signal amplitudes into discrete digital values called “sample points”. The acquisition block performs processing functions such as filtering and sample decimation. The output data are stored in the acquisition memory as “waveform samples”.
Why decimate the data before storing it to the acquisition memory instead of after?

That said, the block diagram they supply for the RTO architecture (page 13) is very much like what I'm envisioning here, but the memory implementation I have in mind would be double buffered so as to ensure that display processor reads from acquisition memory never collide with the writes coming from the acquisition engine.

Its almost like you've never used a scope....
You can decimate before writing to the acquisition memory for modes such as min/max, or hi-res, where the ADC samples at a higher rate than you're writing to the sample memory (because of memory limitations and you want a longer record).

Quote from: kcbrown on March 11, 2017, 03:25:37 am

Losing information for display purposes is unavoidable here. What matters is what the decimation process did to transform the acquired data into something for the display, and that is very operation-specific.

YES, and some similaifcations are appropriate for some applications. But 2D histograms are the benchmark which almost all scopes today provide.

Quote from: kcbrown on March 11, 2017, 03:25:37 am

Quote from: Someone on March 11, 2017, 02:43:04 am
You dont know in advance which of the samples in the waveform has the interesting information in them, so the scope needs to parse (process) ALL of them. If you knew where the interesting information was then you'd have captured just that part alone.
You're conflating captured data with processed data. The triggering mechanism is what defines the points of interest, and that absolutely has to keep up with the sampling rate. I've never argued otherwise for that. The rest is a matter of how the UI interacts with the rest of the system.

No, thats the Lecroy slight of hand where they have a banner spec for dead time that is only for segmented captures where the data is not going to the screen at that throughput consistently (a tiny fast burst). If the trigger defines what you're interested in then what are you doing with the deep memory? You could wind it down to an acquisition depth the same as the screen size which is a trivial display case.

A: you have long memory and capture something interesting to look at later/slower/in depth
B: you want to capture as much information as possible and display it on the screen

A is easy, the interface is slow because to draw the XXMpts of memory to the screen takes a long time. Seriously sit down and consider how long a computer would take to read a 100,000,000 point record and just apply the trivial min/max envelope to it for display at 1000px across.
B is where the magic hardware comes in to do all that work in hardware, but it can also run on stopped traces to speed up case A. Thus a better scope for all uses because it has hardware accelerated plotting.

mikeselectricstuff · « **Reply #83 on:** March 11, 2017, 10:00:55 am »

Quote from: nctnico on March 11, 2017, 01:40:26 am

Note how Dave notices the ASIC gets 'hot as hell' in his DSO1000X hack video.

That was wrong - actually it's the ADC that gets very hot, not the ASIC. Compare the heatsink sizes.

mikeselectricstuff · « **Reply #84 on:** March 11, 2017, 10:11:52 am »

I'm sure you could do a pretty good scope with DDR3 and an FPGA, but it would be an expensive FPGA, as you'd need high pincount for a wide memory bus to get the throughput, and lots of on-chip RAM to buffer out the lumpiness of the memory interface. FPGA pins and on-chip RAM start to get really expensive above a certain point.
A low-end scope these days just can't afford to use an expensive FPGA, so they have to do what they can within the cost constraints.
There are probably interesting things you could do with an ASIC and modern RAM, but the low price and crowded market these days means it would be a huge risk that would probably not pay off.

I'm sure there are very good reasons Agilent put the acquisition memory on their ASIC - this would not have been a cheap decision.

kcbrown · « **Reply #85 on:** March 12, 2017, 01:59:18 am »

Quote from: Someone on March 11, 2017, 04:14:03 am

Quote from: kcbrown on March 11, 2017, 03:25:37 am
I'll cut to the chase with a question: which operations, aside from the intensity grading operation, require processing of all the data in the buffer? Even the FFT doesn't require that.
An FFT on the entire memory depth is extremely useful and desirable,

Perhaps so. But I didn't ask about that. There is a massive difference between desirable (even if extremely so) and required.

I asked my question the way I did because I want to know what operations of the scope you cannot properly do at all unless it is on the entire capture buffer. Glitch detection is a good example of that. Min/max processing is another. And triggering is clearly another.

Quote

the Agilent/Keysight X series dont do it and its a limitation of them. As I keep repeating endlessly, take some of the simple display examples and do your own maths on the required computational load. Just min/max to draw the envelope of the deep memory to a smaller display is computationally expensive enough to prove the point. Moving that sort of data throughput in and out of a CPU is not viable, and requires dedicated hardware resources to achieve the peak rates that current products are doing.

Which current products? And what kind of CPU? You'll get no argument from me that as you increase the sample rate, the demands will eventually exceed what even the fastest CPU is capable of. And you'll get no argument from me that an FPGA or ASIC clocked at the same rate as the CPU would be is going to be a faster solution if done right.

Quote

Quote from: kcbrown on March 11, 2017, 03:25:37 am
Quote
The measurement signal enters the oscilloscope at the channel input and is conditioned by attenuators or amplifiers in the vertical system. The analog-to-digital converter (ADC) samples the signal at regular time intervals and converts the respective signal amplitudes into discrete digital values called “sample points”. The acquisition block performs processing functions such as filtering and sample decimation. The output data are stored in the acquisition memory as “waveform samples”.
Why decimate the data before storing it to the acquisition memory instead of after?

That said, the block diagram they supply for the RTO architecture (page 13) is very much like what I'm envisioning here, but the memory implementation I have in mind would be double buffered so as to ensure that display processor reads from acquisition memory never collide with the writes coming from the acquisition engine.

Its almost like you've never used a scope....
You can decimate before writing to the acquisition memory for modes such as min/max, or hi-res, where the ADC samples at a higher rate than you're writing to the sample memory (because of memory limitations and you want a longer record).

Well, okay, sure, but that's not normal acquisition, is it? I was interpreting the paper as talking about the baseline hardware architecture, not just some optional path that would be useful in certain cases.

Quote

Quote from: kcbrown on March 11, 2017, 03:25:37 am
Losing information for display purposes is unavoidable here. What matters is what the decimation process did to transform the acquired data into something for the display, and that is very operation-specific.
YES, and some similaifcations are appropriate for some applications. But 2D histograms are the benchmark which almost all scopes today provide.

Right. And frankly, it seems to me that the intensity grading processing would be ideal for an FPGA, because it's something the scope would be doing essentially all of the time and it's something that (at least on the surface) seems simple enough to implement.

Quote

Quote from: kcbrown on March 11, 2017, 03:25:37 am
Quote from: Someone on March 11, 2017, 02:43:04 am
You dont know in advance which of the samples in the waveform has the interesting information in them, so the scope needs to parse (process) ALL of them. If you knew where the interesting information was then you'd have captured just that part alone.
You're conflating captured data with processed data. The triggering mechanism is what defines the points of interest, and that absolutely has to keep up with the sampling rate. I've never argued otherwise for that. The rest is a matter of how the UI interacts with the rest of the system.
No, thats the Lecroy slight of hand where they have a banner spec for dead time that is only for segmented captures where the data is not going to the screen at that throughput consistently (a tiny fast burst). If the trigger defines what you're interested in then what are you doing with the deep memory?

Capturing the rest of the data so that it can be looked at and even analyzed after the trigger fires, of course. The trigger defines what you're primarily interested in, not the only thing you're interested in. In some situations, the trigger might actually not be what you're primarily interested in at all, but instead some sort of indicator that what you're interested is nearby.

Look, ultimately the point of all this processing is to put something on the display that's useful to the operator, right? But the nature of what you're displaying is fleeting, there for only an instant of time and then it's gone, to be replaced by the next refresh. It's why we have persistence settings at all -- we otherwise would miss events of interest.

Those events that are truly of interest are events that you'll not only want to be able to see on the screen, they're events that you'll somehow want to (if only indirectly) trigger the scope with. Glitch detection and display is one of those cases where you have no choice but to process all of the relevant points, but what else demands that? Even the FFT doesn't, as long as your FFT implementation preserves the (frequency domain) statistical quality of the capture.

Quote

You could wind it down to an acquisition depth the same as the screen size which is a trivial display case.

Sure, you could, but if you did that without retaining the original data, then you'd lose the original data and wouldn't be able to stop the scope (most especially automatically as a result of a trigger condition, or mask detection condition, etc.) and then do things like, say, zoom out to see the bigger picture, or pan to see additional parts of the waveform, or do things like decoding of the trace data that isn't within the time window of the display, or a myriad of other things that wouldn't be practical to do on a realtime basis but are easily accomplished once the scope is stopped.

You could even perform an FFT on the entire buffer at that point.

Preserving data gives you options later that you otherwise wouldn't have. More options is better than fewer options, all other things being equal. Since you can always use a subset of captured data for any operation (even intensity-graded display, if it came down to it) in order to preserve operational speed, I see no downside whatsoever to providing large buffers.

Put another way, if you have a choice between two scopes that have equal processing capability, but one has more memory than the other, why in the world would you choose the scope that has less memory???

Quote

A: you have long memory and capture something interesting to look at later/slower/in depth
B: you want to capture as much information as possible and display it on the screen

Again, what exactly does (B) even mean here? Your display is limited in size. You have to reduce the data in some way just to show anything at all.

Quote

A is easy, the interface is slow because to draw the XXMpts of memory to the screen takes a long time. Seriously sit down and consider how long a computer would take to read a 100,000,000 point record and just apply the trivial min/max envelope to it for display at 1000px across.

That's an interesting idea, and proved to be an interesting exercise.

It took 6 milliseconds for the min/max computation on that number of points (in other words, just to determine whether each point was within the min/max envelope), with what amounts to a naive approach to the problem (no fancy coding tricks or anything). But once you add in the output buffer code, it rises to about 135 milliseconds. This is on a single core of a 2.6GHz Intel i5. On the one hand, this was just on a single core, and the nature of the operation is such that I'm writing to the same RAM as I'm reading from (which may or may not matter). But on the other hand, the CPU I tested this on, while not the most modern (2014 vintage), is certainly going to be relatively expensive for embedded use.

So operations like this definitely call for an FPGA or execution on multiple CPU cores. They obviously also call for highly optimized approaches to the problem, which my code most certainly wasn't.

For that 100M points to occupy a sample period of 1/30th of a second or less, the sample rate has to be at least 3GS/s. So that's reasonable, but already outside the sample rate of entry-level scopes. Nevertheless, the above is instructive in terms of showing the limits of a naive approach to the min/max processing problem on relatively modern off-the-shelf hardware.

Quote

B is where the magic hardware comes in to do all that work in hardware, but it can also run on stopped traces to speed up case A. Thus a better scope for all uses because it has hardware accelerated plotting.

No doubt. It's more efficient at a lot of things, but again, the processing is the bottleneck.

It looks to me like we're saying roughly the same thing (now especially, after the results of the test you had me perform), but for some reason you seem to believe that if you have more memory, then you have to scale the processing to match. I disagree with that position, if indeed that is your position in the first place. Keep in mind that the OP's question was about why we don't see scopes with big piles of modern off-the-shelf memory.

kcbrown · « **Reply #86 on:** March 12, 2017, 03:41:56 am »

Quote from: kcbrown on March 12, 2017, 01:59:18 am

Quote from: Someone on March 11, 2017, 04:14:03 am
A is easy, the interface is slow because to draw the XXMpts of memory to the screen takes a long time. Seriously sit down and consider how long a computer would take to read a 100,000,000 point record and just apply the trivial min/max envelope to it for display at 1000px across.

That's an interesting idea, and proved to be an interesting exercise.

It took 6 milliseconds for the min/max computation on that number of points (in other words, just to determine whether each point was within the min/max envelope), with what amounts to a naive approach to the problem (no fancy coding tricks or anything). But once you add in the output buffer code, it rises to about 135 milliseconds. This is on a single core of a 2.6GHz Intel i5. On the one hand, this was just on a single core, and the nature of the operation is such that I'm writing to the same RAM as I'm reading from (which may or may not matter). But on the other hand, the CPU I tested this on, while not the most modern (2014 vintage), is certainly going to be relatively expensive for embedded use.

For fun and science, I performed the same experiment on my Raspberry Pi 3. It took 600ms to complete the more complete test with output buffer code. The Pi's processor is a lot more like what you'd expect to find in embedded applications, so I suspect this result is a better representation of what one could expect.

But note that it means that if you have a buffer of a million sample points, this process would complete in 6 milliseconds.

Another note: What I was doing in this test is not the same as peak detect mode. I was writing the result of processing each sample through the min/max test to memory, but with min/max, I would be doing so only once per display pixel. I'll re-run the test with that in mind.

Someone · « **Reply #87 on:** March 12, 2017, 04:30:49 am »

Quote from: kcbrown on March 12, 2017, 01:59:18 am

Quote from: Someone on March 11, 2017, 04:14:03 am
A is easy, the interface is slow because to draw the XXMpts of memory to the screen takes a long time. Seriously sit down and consider how long a computer would take to read a 100,000,000 point record and just apply the trivial min/max envelope to it for display at 1000px across.
It took 6 milliseconds for the min/max computation on that number of points (in other words, just to determine whether each point was within the min/max envelope), with what amounts to a naive approach to the problem (no fancy coding tricks or anything). But once you add in the output buffer code, it rises to about 135 milliseconds. This is on a single core of a 2.6GHz Intel i5. On the one hand, this was just on a single core, and the nature of the operation is such that I'm writing to the same RAM as I'm reading from (which may or may not matter). But on the other hand, the CPU I tested this on, while not the most modern (2014 vintage), is certainly going to be relatively expensive for embedded use.

So you expect us all to believe you got just under 7 points through a single core per instruction cycle? 100M/(2.5G*6ms) = 6.666.. points per clock! and a memory bandwidth of 16GB/s (130Gb/s) close to the bleeding edge in desktop performance. Its possible to stick some extra in there with vector stuffing but that's an implausible result to include i/o for data which doesnt fit into the cache so you have memory overheads to also consider etc. A min/max algorithm in a processor such as that with a deep pipeline either stalls the pipe to collect all the partial results, or has a very nasty memory access pattern. And this is just a tiny part of the processing needed in a scope but already its not possible with a desktop processor.

That processing in an FPGA to min/max an 8bit 10Gs/s stream would use a dollar or two of FPGA logic, at around 20mW of dynamic consumption and a similar amount of static consumption for a shocking 40-50mW total power requirement. This is why all the major brands are offloading work to the FPGAs (or ASICs), its faster, cheaper, and lower power.

Quote from: kcbrown on March 12, 2017, 01:59:18 am

It looks to me like we're saying roughly the same thing (now especially, after the results of the test you had me perform), but for some reason you seem to believe that if you have more memory, then you have to scale the processing to match. I disagree with that position, if indeed that is your position in the first place. Keep in mind that the OP's question was about why we don't see scopes with big piles of modern off-the-shelf memory.

We've held your hand and shown you over and over again that the scopes are using common available memory to store their acquisitions in. Deep memory scopes that have slow displays have been around a long time and they were never very popular, so when faced with the design of a new scope the manufacturers are putting in sufficient memory to meet the markets demands. If you want a cheap scope with deep memory buy a rigol, and then come back when you can't stand using it because navigation of the memory is so difficult. What you're talking about exists but its not something people want to use and the market sees little value in it, the manufacturers aren't short changing anyone but they've all come to a similar point of what the market will pay for. Adding in an extra XXMpts of memory costs real money that they won't profit on, so oddly enough they choose not to do it.

kcbrown · « **Reply #88 on:** March 12, 2017, 05:52:45 am »

Changing the test to do a min/max pass on groups of 100000 points, with one group per pixel, where I write the min value and max value to individual 1000-element buffers, yields 6 milliseconds on my 2.6GHz Core i5 processor system (a 2014 Macbook Pro 13"). On my Raspberry Pi 3, it yields 421 milliseconds. On my 2010 Mac Mini with a 2.4GHz Core i7, it yields 8 milliseconds. On my Linux virtual machine running on my Xeon E5-1650v4, it yields 8 milliseconds.

Note that these are all CPU times, not wall clock times. I've got a memory speed test program, which uses the same method for recording CPU time, that I will also attach and that yields memory throughput results that are a very close match to what the memory test programs (Passmark MemTest86) I've run show for the systems on which I've run both. The memory speed test program shows a throughput of 1.5GB/s on the Raspberry Pi 3, 8GB/s on my Mac Mini, 20G/s on my Macbook Pro, and 13.5G/s in my virtual machine on my Xeon system.

Now, it's possible that the CPU timekeeping mechanism under-reports the amount of CPU time used, but most certainly not by orders of magnitude. The memory test on my virtual machine yields results fairly close to the actual value I'm getting from PassMark's memtest86 program when run against the Xeon host itself (about 15G/s).

Quote from: Someone on March 12, 2017, 04:30:49 am

Quote from: kcbrown on March 12, 2017, 01:59:18 am
Quote from: Someone on March 11, 2017, 04:14:03 am
A is easy, the interface is slow because to draw the XXMpts of memory to the screen takes a long time. Seriously sit down and consider how long a computer would take to read a 100,000,000 point record and just apply the trivial min/max envelope to it for display at 1000px across.
It took 6 milliseconds for the min/max computation on that number of points (in other words, just to determine whether each point was within the min/max envelope), with what amounts to a naive approach to the problem (no fancy coding tricks or anything). But once you add in the output buffer code, it rises to about 135 milliseconds. This is on a single core of a 2.6GHz Intel i5. On the one hand, this was just on a single core, and the nature of the operation is such that I'm writing to the same RAM as I'm reading from (which may or may not matter). But on the other hand, the CPU I tested this on, while not the most modern (2014 vintage), is certainly going to be relatively expensive for embedded use.
So you expect us all to believe you got just under 7 points through a single core per instruction cycle? 100M/(2.5G*6ms) = 6.666.. points per clock! and a memory bandwidth of 16GB/s (130Gb/s) close to the bleeding edge in desktop performance.

You don't have to believe me. You can try it out for yourself. I'm attaching the source to the test program I'm using. Feel free to pull it apart and to run the test yourself.

Like I said, I think the Raspberry Pi tests are much more indicative of what you could expect from an embedded CPU.

Quote

Its possible to stick some extra in there with vector stuffing but that's an implausible result to include i/o for data which doesnt fit into the cache so you have memory overheads to also consider etc. A min/max algorithm in a processor such as that with a deep pipeline either stalls the pipe to collect all the partial results, or has a very nasty memory access pattern. And this is just a tiny part of the processing needed in a scope but already its not possible with a desktop processor.

I don't know how to explain the results I'm getting. But they speak for themselves, and the method I'm using to retrieve the CPU usage times is the same as what the memory test program I'm using uses. Since the latter yields numbers that are plausible, I've no reason to question the CPU usage measurement approach.

Quote

That processing in an FPGA to min/max an 8bit 10Gs/s stream would use a dollar or two of FPGA logic, at around 20mW of dynamic consumption and a similar amount of static consumption for a shocking 40-50mW total power requirement. This is why all the major brands are offloading work to the FPGAs (or ASICs), its faster, cheaper, and lower power.

No argument with that.

Quote

We've held your hand and shown you over and over again that the scopes are using common available memory to store their acquisitions in.

True. The OP's question was why the amount isn't considerably larger than it is. I think your "market demands" answer is pretty much on the mark for the most part. We're now starting to see larger memory scopes (e.g., the Siglent series). Based on our discussion, I'd have to chalk that up to wider and cheaper availability of FPGAs.

Quote

Deep memory scopes that have slow displays have been around a long time and they were never very popular, so when faced with the design of a new scope the manufacturers are putting in sufficient memory to meet the markets demands.

Is there really a difference in the display speed between a scope with deep memory and a scope that lacks deep memory, if the number of samples being processed for the display between both is the same?

If not, then display speed isn't an excuse for not providing more memory, is it? Market demands, on the other hand, is always a winning argument.

Quote

If you want a cheap scope with deep memory buy a rigol, and then come back when you can't stand using it because navigation of the memory is so difficult.

How much of that is down to processing speed and how much is down to poor architectural decisions in the UI code, and how would one tell the difference?

Quote

What you're talking about exists but its not something people want to use and the market sees little value in it, the manufacturers aren't short changing anyone but they've all come to a similar point of what the market will pay for. Adding in an extra XXMpts of memory costs real money that they won't profit on, so oddly enough they choose not to do it.

And that's an answer I have no problem with whatsoever.

EDIT: Found an initialization bug. It doesn't affect the test results, but it's an embarrassment all the same. I'm thus replacing the previous version with the fixed version.

Someone · « **Reply #89 on:** March 12, 2017, 08:12:26 am »

Quote from: kcbrown on March 12, 2017, 05:52:45 am

Changing the test to do a min/max pass on groups of 100000 points, with one group per pixel, where I write the min value and max value to individual 1000-element buffers, yields 6 milliseconds on my 2.6GHz Core i5 processor system (a 2014 Macbook Pro 13"). On my Raspberry Pi 3, it yields 421 milliseconds. On my 2010 Mac Mini with a 2.4GHz Core i7, it yields 8 milliseconds. On my Linux virtual machine running on my Xeon E5-1650v4, it yields 8 milliseconds.

Note that these are all CPU times, not wall clock times. I've got a memory speed test program, which uses the same method for recording CPU time, that I will also attach and that yields memory throughput results that are a very close match to what the memory test programs (Passmark MemTest86) I've run show for the systems on which I've run both. The memory speed test program shows a throughput of 1.5GB/s on the Raspberry Pi 3, 8GB/s on my Mac Mini, 20G/s on my Macbook Pro, and 13.5G/s in my virtual machine on my Xeon system.

Now, it's possible that the CPU timekeeping mechanism under-reports the amount of CPU time used, but most certainly not by orders of magnitude. The memory test on my virtual machine yields results fairly close to the actual value I'm getting from PassMark's memtest86 program when run against the Xeon host itself (about 15G/s).

You don't have to believe me. You can try it out for yourself. I'm attaching the source to the test program I'm using. Feel free to pull it apart and to run the test yourself.

Thanks for the well laid out code, you can run it on any of the web facing servers for testing code and its always returning times of 0.1 to 0.2 seconds per test by the number the code spits out or their timers on the tasks (its also confirmed to use 100MB so we know its not optimising everything away). So something between your single digit ms results and the real world is messed up again. Back of the envelope calculations say it can't process the data that fast we don't need to go any deeper unless you wish to point out some extremely deep vector instructions which make this task so fast.

This is a tiny part of the processing and its already blowing out your processing budget, adding deep memory to a scope without making it slow as a wet week just doesn't work.

kcbrown · « **Reply #90 on:** March 12, 2017, 09:46:44 am »

Quote from: Someone on March 12, 2017, 08:12:26 am

Thanks for the well laid out code, you can run it on any of the web facing servers for testing code and its always returning times of 0.1 to 0.2 seconds per test by the number the code spits out or their timers on the tasks (its also confirmed to use 100MB so we know its not optimising everything away).

It looks like it's a question of the compiler optimizations. Use -O3 for the optimization option and you get the results I'm getting.

Interestingly enough, adding some instrumentation to it so as to force it to go through all the values even with full optimization does substantially increase the execution time, by over an order of magnitude. My Macbook generates an execution time of 100ms after adding that in. I'll add the revised code here so you can check it out.

Quote

So something between your single digit ms results and the real world is messed up again. Back of the envelope calculations say it can't process the data that fast we don't need to go any deeper unless you wish to point out some extremely deep vector instructions which make this task so fast.

Sometimes compiler optimizers are too smart for their own good.

Quote

This is a tiny part of the processing and its already blowing out your processing budget, adding deep memory to a scope without making it slow as a wet week just doesn't work.

Well, yes, if you add deep memory and then change how operations work so as to use more of it and you keep the processing components the same, then of course it's going to be slower. You can still provide more memory and even make good use of it for some things, of course, but if you're going to provide the same responsiveness while making use of greater amounts of memory under identical situations, then the processing capability has to be scaled to match.

This has proven quite instructive, nonetheless, in that it shows pretty conclusively that even fast desktop processors won't be able to do even the simplest of operations on large amounts of memory sampled at low-end (not to be confused with entry-level.

) oscilloscope sampling rates.

So: FPGAs/ASICs for the win! I guess in light of this, it's a bit surprising that the cheap scopes do as well as they do.

I'm still interested in knowing how one can differentiate between poor UI coding and processing limitations when it comes to things like examining the memory. You mentioned the poor Rigol memory handling in the UI more than once. Why can't it do things like subsampling while the UI is being twiddled, so as to retain responsiveness in the UI? Once the user has made the changes he wants, the scope can go back to full processing again (at whatever speed it's capable of, naturally).

kcbrown · « **Reply #91 on:** March 12, 2017, 09:50:47 am »

By the way, I absolutely love this forum. You can always find someone here who knows more than you do.

newbrain · « **Reply #92 on:** March 12, 2017, 11:30:03 am »

Quote from: kcbrown on March 12, 2017, 09:46:44 am

It looks like it's a question of the compiler optimizations. Use -O3 for the optimization option and you get the results I'm getting.

Interestingly enough, adding some instrumentation to it so as to force it to go through all the values even with full optimization does substantially increase the execution time, by over an order of magnitude. My Macbook generates an execution time of 100ms after adding that in. I'll add the revised code here so you can check it out.
e can go back to full processing again (at whatever speed it's capable of, naturally).

A couple of things about the (not so) strange timing results:
1- Visual Studio compiler will happily use pmaxub and pminub SSE vector instructions (checked the .asm output). On my old i5-750 it takes 12ms per test.
2- In your second code version, you have:

Code: [Select]

	if (*p > max)
	  max ^= *p;
	if (*p < min)
	  min ^= *p;

The XOR defeats the optimization: SSE is not used, and time goes up. Was that actually done on purpose?

David Hess · « **Reply #93 on:** March 12, 2017, 11:36:29 am »

Quote from: kcbrown on March 12, 2017, 09:46:44 am

I'm still interested in knowing how one can differentiate between poor UI coding and processing limitations when it comes to things like examining the memory. You mentioned the poor Rigol memory handling in the UI more than once. Why can't it do things like subsampling while the UI is being twiddled, so as to retain responsiveness in the UI? Once the user has made the changes he wants, the scope can go back to full processing again (at whatever speed it's capable of, naturally).

As far as UI responsiveness, they are just poorly programmed. Fast is not the same thing as real time. I suspect there is a general trend toward using "safe" programming languages with garbage collection which makes things worse if extreme care is not taken. This is the sort of thing that makes the 50 MHz 32-bit ARM in my HP50g slower in the practical sense than the 4 MHz 4-bit Saturn in my decades old HP48g. It also makes my ancient DSOs with their slower user interfaces faster and more usable than their feature set and hardware specifications would suggest.

Subsampling would have to be a separate selected feature in the FPGA code. Otherwise the processor has to spend the full cost of copying the FPGA memory and then discarding part of it.

As far as processing large record lengths with a high performance CPU, not only does the CPU have to process the acquisition record, but it has to *access* it somehow. In current designs this means copying the contents of the FPGA's memory to CPU memory through the FPGA during which time the FPGA cannot use its own memory without special provisions which are going to require faster memory yet. And the CPU's fastest interface, its memory interface, is not amendable to multimastering or attachment to an FPGA.

Copying is actually not as bad as it seems and may be preferable. While processing the copied acquisition record, the FPGA can grab another acquisition. Adding another memory channel to the FPGA to support double buffering would only double the acquisition rate and not even that if the CPU processing is a bottleneck.

On high end DSOs a common solution is to use a standard x86 motherboard with a DSO PCI/PCIe expansion board. I do not know if these designs stream to system memory over PCI/PCIe or if they use local memory and then copy it over but I suspect the later because they do not support the massive record lengths that streaming to system memory would allow. I suspect inspection of a modern LeCroy would reveal how they handle it.

kcbrown · « **Reply #94 on:** March 12, 2017, 01:35:47 pm »

Quote from: newbrain on March 12, 2017, 11:30:03 am

A couple of things about the (not so) strange timing results:
1- Visual Studio compiler will happily use pmaxub and pminub SSE vector instructions (checked the .asm output). On my old i5-750 it takes 12ms per test.
2- In your second code version, you have:
Code: [Select]
if (*p > max) max ^= *p; if (*p < min) min ^= *p;The XOR defeats the optimization: SSE is not used, and time goes up. Was that actually done on purpose?

It was done on purpose in order to show that the values being set were not somehow precomputed by the compiler or something of that sort. The code didn't look like it would be such that the compiler would be able to optimize that out in any meaningful way regardless, but I didn't want to bother looking at the assembly, so I used a construct that I knew would be essentially impossible for it to optimize away and that would yield output values that would make it clear that the code was executing as expected.

I've since come up with a different way of achieving the same thing, so I've created a third revision, which I'll attach here, which uses the comparison/assignment that was previously there and which makes clear that the program is in fact operating on all of the data.

With this third revision, we're now back to a 6 millisecond execution time per test iteration on my Macbook Pro when full optimization is enabled. But now the onus is on Someone to explain how this isn't a valid test, seeing how it has to be processing all of the 100M points of data in order to generate the same output regardless of what optimization flags are set. The execution is over 30 times faster with full optimization enabled, and obviously much faster than Someone expects it to be. I have no explanation for this, save perhaps that Intel CPUs are far more capable at some operations than he expects them to be.

So here's the thing: if the processor is able to go through this much data that fast in this test program, why should we believe this operation wouldn't be similarly fast for arbitrary buffer contents, at least when using the same processor architecture as is used in modern PCs?

Of course, that might not be all that relevant. Like I said, the type of processor used for embedded use (at least in entry level scopes) is likely to be more akin to that in the Raspberry Pi 3 rather than a modern PC, and the performance numbers for that processor when running this test are much more in line with what Someone expects. Even so, it may be illustrative of what modern processors are capable of, at least for certain types of operations.

newbrain · « **Reply #95 on:** March 12, 2017, 02:09:32 pm »

Quote

I have no explanation for this, save perhaps that Intel CPUs are far more capable at some operations than he expects them to be.

Well, instructions as pmaxub/pminub operate on 128bits values, i.e. 16 bytes at a time.
As long as the memory throughput is enough, they can pump a good amount of data...

Here is the inner loop, which also shows some loop unrolling (missing the loop count leftovers):

Code: [Select]

$LL16@main:

; 45   : 				if (*p > max)
; 46   : 					max = *p;

	movups	xmm0, XMMWORD PTR [esi]
	pmaxub	xmm1, xmm0
	pminub	xmm2, xmm0
	movups	xmm0, XMMWORD PTR [esi+16]
	add	esi, 32					; 00000020H
	pmaxub	xmm4, xmm0
	pminub	xmm3, xmm0
	sub	eax, 1
	jne	SHORT $LL16@main

On this old machine, the whole thing is (as expected?) memory bound: the CPU is nowhere near 100%, even looking at a single core.

Quote

Of course, that might not be all that relevant. Like I said, the type of processor used for embedded use (at least in entry level scopes) is likely to be more akin to that in the Raspberry Pi 3 rather than a modern PC, and the performance numbers for that processor when running this test are much more in line with what Someone expects. Even so, it may be illustrative of what modern processors are capable of, at least for certain types of operations.

Agreed, for a low cost instrument, this kind of processor would double the price.

kcbrown · « **Reply #96 on:** March 12, 2017, 02:23:50 pm »

Quote from: David Hess on March 12, 2017, 11:36:29 am

As far as UI responsiveness, they are just poorly programmed. Fast is not the same thing as real time.

Yes, that is essentially the point I'd been trying to make with respect to the UI, but I guess I wasn't making it very well. The above is much more succinct.

Quote

I suspect there is a general trend toward using "safe" programming languages with garbage collection which makes things worse if extreme care is not taken. This is the sort of thing that makes the 50 MHz 32-bit ARM in my HP50g slower in the practical sense than the 4 MHz 4-bit Saturn in my decades old HP48g.

Well, in the case of the HP 50g, it's actually emulating the Saturn processor for much of what it does. I've no idea if the emulator does garbage collection and such internally (wouldn't surprise me if it does, though).

But this is a great point in general, and a sentiment I agree with. I'm not a fan at all of programming languages such as Java in part because they make performance something that varies over time for code that would otherwise have very consistent performance. If language safety alone is something someone desires, they can use Ada.

Quote

It also makes my ancient DSOs with their slower user interfaces faster and more usable than their feature set and hardware specifications would suggest.

Subsampling would have to be a separate selected feature in the FPGA code. Otherwise the processor has to spend the full cost of copying the FPGA memory and then discarding part of it.

Yeah, I was thinking it could be something that the FPGA design could incorporate as an option.

Quote

As far as processing large record lengths with a high performance CPU, not only does the CPU have to process the acquisition record, but it has to *access* it somehow.

I was presuming that the acquisition FPGA could write directly to the acquisition DRAM, and the acquisition memory could be set up in a double-buffered ("banked"?) configuration so that the FPGA's writes wouldn't collide with reads performed by the downstream processing pipeline. Which is to say, the FPGA's writes would go through a switch which would send the data to one DRAM bank or the other, depending on which one was in play at the time.

Quote

In current designs this means copying the contents of the FPGA's memory to CPU memory through the FPGA during which time the FPGA cannot use its own memory without special provisions which are going to require faster memory yet. And the CPU's fastest interface, its memory interface, is not amendable to multimastering or attachment to an FPGA.

It's not? How have multiprocessor systems managed it?

Quote

Copying is actually not as bad as it seems and may be preferable. While processing the copied acquisition record, the FPGA can grab another acquisition. Adding another memory channel to the FPGA to support double buffering would only double the acquisition rate and not even that if the CPU processing is a bottleneck.

Right. Eventually the CPU processing would become a bottleneck, but it could easily have FPGA help. I've not looked at the Zynq setup, but wouldn't be surprised if its onboard CPU has some kind of fast interface to the FPGA fabric to make CPU+FPGA operations possible and relatively straightforward.

David Hess · « **Reply #97 on:** March 12, 2017, 06:41:47 pm »

Quote from: kcbrown on March 12, 2017, 02:23:50 pm

Quote from: David Hess on March 12, 2017, 11:36:29 am
Subsampling would have to be a separate selected feature in the FPGA code. Otherwise the processor has to spend the full cost of copying the FPGA memory and then discarding part of it.

Yeah, I was thinking it could be something that the FPGA design could incorporate as an option.

In most designs, the ASIC or FPGA already performs decimation when the record length will not support the maximum sample rate. There are technical reasons to always operate the ADC at its maximum sample rate and even if this is not necessary, it is required to implement peak detection and is a good idea for DPO style operating modes. I like this solution.

This points to a good reason that decimation should not be done on the processor; it will either limit the sampling rate so the processor can keep up or increase the blind time. An ASIC is faster than an FPGA which is faster than a vector processor which is faster than a scalar processor.

LeCroy seems to have beaten this or says they have but someone who knows will have say how. I am dubious and as far I know, their low end DSOs work the same as everybody elses.

Quote

Quote
As far as processing large record lengths with a high performance CPU, not only does the CPU have to process the acquisition record, but it has to *access* it somehow.

I was presuming that the acquisition FPGA could write directly to the acquisition DRAM, and the acquisition memory could be set up in a double-buffered ("banked"?) configuration so that the FPGA's writes wouldn't collide with reads performed by the downstream processing pipeline. Which is to say, the FPGA's writes would go through a switch which would send the data to one DRAM bank or the other, depending on which one was in play at the time.

That is really difficult to do with modern high performance DRAM short of building your own memory controller with two interfaces. Older DSOs did work this way by sharing the memory bus.

PCI/PCIe is a multimaster expansion interface so the digitizer using it can access system memory through the CPU or the reverse. But if the CPU is accessing memory through PCI/PCIe, throw away most of its memory performance; PCI/PCIe is not a low latency memory bus. Dumping data into or copying data to the CPU's memory is much better.

But this is beyond low and most mid range DSOs. The higher end of the Tektronix middle range DSOs which do use a PCIe expansion card for the DSO might be doing this but the supported record lengths do not seem to indicate it. Maybe Tektronix picked record lengths based on processor performance? I was not impressed with their speed when using long record lengths.

The various PCIe interfaces faster than PCIe x8 1.0 (2 GB/s) have the paper bandwidth but I would be real careful about the numbers; the PCIe controller can limit throughput significantly below the physical interface speed. How many ARM processors have PCIe expansion buses, do they support DMA, and how fast are they? Check out the price of FPGAs which support the fastest PCIe buses; they are not cheap.

Quote

Quote
In current designs this means copying the contents of the FPGA's memory to CPU memory through the FPGA during which time the FPGA cannot use its own memory without special provisions which are going to require faster memory yet. And the CPU's fastest interface, its memory interface, is not amendable to multimastering or attachment to an FPGA.

It's not? How have multiprocessor systems managed it?

Current multiprocessor designs implement coherent network between the CPU cores and the memory interfaces. On x86 this means Hypertransport from AMD and QPI from Intel. I assume ARM has something similar but I do not remember it being exposed through an external interface.

That would be a great place to put a DSO front end but Intel and AMD are not exactly forthcoming on the subject. If you want to do this, then use the higher latency PCIe interface.

Quote

Quote
Copying is actually not as bad as it seems and may be preferable. While processing the copied acquisition record, the FPGA can grab another acquisition. Adding another memory channel to the FPGA to support double buffering would only double the acquisition rate and not even that if the CPU processing is a bottleneck.

Right. Eventually the CPU processing would become a bottleneck, but it could easily have FPGA help. I've not looked at the Zynq setup, but wouldn't be surprised if its onboard CPU has some kind of fast interface to the FPGA fabric to make CPU+FPGA operations possible and relatively straightforward.

The Zynq solves a lot of problem but can its serial interfaces connect to ADCs and operate continuously? These things cost as much as an entire Rigol DSO. A dedicated FPGA for each channel would be cheaper.

Earlier in the discussion I looked into a design using a minimum width DDR2 or DDR3 interface and FPGA but did not bother posting my conclusions. It can be done and I think I have seen some designs which worked this way but while it allows long record lengths, it does not help with blind time because of copying and processing limitations. I got the feeling that specialized DSOs with super long record lengths are designed this way.

It did lead me to wonder how many designers of DDR2/DDR3 memory controllers have gone insane. In the recent past I looked at a modern SRAM based design but the parallelism of DDR2/DDR3 has a lot of advantages even though they are complicated. Just using the FPGA or ASIC's internal memory has a lot going for it.

nctnico · « **Reply #98 on:** March 12, 2017, 06:51:54 pm »

Using DDR memory with an FPGA is easy nowadays. Just drop a piece of pre-cooked IP in your design and done. Xilinx allows to have multiple narrow memory interfaces to the fabric (basically creating a multi-port memory) or a really wide one. In the Xilinx Zync the memory is shared between the processor and the FPGA fabric through multiple memory interfaces.

ElectronMan · « **Reply #99 on:** March 12, 2017, 08:10:39 pm »

Very interesting thread. But there is one thing I haven't seen mentioned that likely impacts the topic. EMI. Dropping a power-gulping i7 with multiple high clock cores and connected to very high speed (and frequency) DDR3 RAM into a device that is very sensitive to EMI is likely a lot more complicated than it sounds. That is before even considering the increase in power/larger noisier PSU, and all the extra waste heat you now have to keep away from thermally sensitive components like oscillators.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Oscilliscope memory, type and why so small? (Read 32028 times)

Share me