Author Topic: Looking for ideas: data logger algorithms and data structures. (Read 2462 times)

ataradov · « **on:** December 03, 2018, 11:01:22 pm »

I've been trying to wrap my head around this task that came up multiple times and I always just hacked my way around it. But now I want to solve it well.

Let's say we have a data logging application. The software receives log entries and has a typical window with a scroll bar interface. There can be a lot of entries. Let's say 250 million for the sake of this discussion. Let's also assume that all entries are the same size (they may have pointers to other data, it is not important) and originally located in a plain sequential array. Let's say each entry is 16 bytes, if it helps the thinking process. So we have ~4 GBytes of data.

The goal is to let users quickly scroll thorough all this. With linear array, the task is trivial, the starting line for display is just a scroll bar position and lines are sequential after that.

Here is the problem. Some of the entries are related to each other and can be hidden under a single collapsible entry. On average 3-10 sequential lines can be combined into logical blocks.

Now the display of this data becomes non-trivial. We can still easily track the number of visible lines by subtracting some when the entry is collapsed and adding back when the entry is open. This lets us easily set the limits for the scroll bar. But since any of the groups may be open or closed (expanded or collapsed) at the same time, mapping from a scroll bar position to the actual data index is hard now.

The goal is to come up with some data structure that will let the application quickly scroll over this huge array. Obviously once the first line is found, it is trivial to find the rest, so all we need to implement is a lookup function DisplayLineToArrayIndex().

The goal is lookup speed. But the open/collapse speed is also important for responsive UI, so it can't involve massive array updates. Additional memory is secondary, but obviously it should not be too big, given how much the main array already takes up.

The information about the groupings may be stored with the data itself, or as a separate entity. The groupings will be performed by the separate algorithm as each entry is received.

Any ideas?

Kalvin · « **Reply #1 on:** December 04, 2018, 01:19:58 am »

My 2c: Each entry should contain a flag which tells that whether the entry is a child or parent. If the flag is set, the entry is a child and the entry has a parent which can be found by finding the previous entry with the child flag unset. Now, you can jump around very quickly and you can always find the parent nodes by checking the flag. Allocate more bits if you need deeper nesting or want more fancier data structure. Add a More bit which will tell whether the entry will be contain data in the next entry as well.

ataradov · « **Reply #2 on:** December 04, 2018, 01:33:20 am »

But that will not help me with the most complicated part of this - if some parents are in a collapsed state, how do I find a data line that corresponds to the visual line N?

The obvious answer is to iterate over the whole array counting lines, but it will take a long time for large data sets. There may be some compromise with having a smaller array of pre-computed partial line counts for chunks of lines. In this case finding the index will be a two stage process - first finding the right chunk, and then finding the exact location inside the chunk.

I was hoping there is a better solution out there.

Kalvin · « **Reply #3 on:** December 04, 2018, 01:38:45 am »

If you keep a track of total number of parents, you could binary search for the parent to be found by "looking around at the neighborhood of the estimated entry point until you find a parent"? If the parent entries contain running entry number, finding the specific entry would become a bit more efficient at the expense of bytes required for this information. Alternatively, you could add a special entry at every 1024 entries (or so) which will contain the running number of the current parent. In that way you would not consume too much of memory for book keeping, and you can find at least the parent entries quite efficiently with binary search.

ataradov · « **Reply #4 on:** December 04, 2018, 01:40:42 am »

I don't understand your idea. Just keeping track of parents will not do much. They can open/closed in arbitrary way.

Kalvin · « **Reply #5 on:** December 04, 2018, 01:52:26 am »

These were more or less random thoughts and ideas. If you can come up with some sort of binary search you can make the search algorithm very efficient. Probably this will require building some sort of indexing.

splin · « **Reply #6 on:** December 04, 2018, 02:02:32 am »

Have an array of 250 entries containing the number of visible entries in each n-millionth block of the main array. Call it block size array, BSA (but preferably you would come up with a better name). So if the array has 130 million entries, all expanded, the first 130 entries in BSA will have the value 1 million, the rest zero.

When an entry is collapsed or expanded, the relevant BSA entry is adjusted. If the change straddles an n-million boundary, 2 entries in the BSA may need adjusting.

To calculate the scrollbar position would now take, in the worst case, 250 iterations through the BSA (summing the entries) and up to 1 million iterations through the relevant block of the main array. By keeping track of the total number of visible entries you can limit the BSA search to 125 by starting at the beginning and working forwards, or backwards from the end.

One million iterations is still quite a lot but should be reasonably quick on a non-Neolithic PC if coded reasonably, but might not be very smooth for the scrollbar. Increasing the size of the BSA to 2500 entries reduces the search range to 100K, but increases the BSA search to 1250 iterations max.

Searching a block of the main array block requires iterating through the currently visible entries, So it would be reasonable for every entry to have a size field indicating how many subsequent entries are contained within a collapsed set, so if three associated entries are collapsed into a single visible entry and surrounded by expanded entries the size fields would be ...1,1,1,3,0,0,1,1,1...

Extracting these size fields into a separate array would improve cache behaviour. Knowing the size is always 7 or less, this array can be an array of 64-bit structures each containing, say, 8 x 8-bit fields, each containing the size value of 0 to 7. Since the object of the search is to count the number of entries from the start of the block, speed would be improved by processing the size array in 64-bit chunks - 36 x 64 bit additions can be performed without any of the summed size fields overflowing which would be followed by a series of 8 steps to sum the 8 partial sums. This reduces the 1 million iterations to 10^6/8 * (36+8)/36 operations = 153k.

Reducing the number of bits for each size field saves memory but increases the total number of operations:
7 bit fields => 10^6/9 * (18+9)/18 = 167k ops,
6 bits => 10^6/10 * (9 + 10)/9 = 211.1k ops,
5 bits => 333.3k,
4 bits => 562.5k.

More complex data structures, including B-trees etc. can obviously be used at the expense of considerably more complexity. I would start with a BSA of 2500 entries which would require 15.3k operations to search a block at the expense of more memory.

If that is still too slow a BSA of 250,000 entries would probably be acceptable in memory use compared to the size of the main array so a block search would now be reduced to approx 153 operations maximum but the search time of the BSA would increase to 125,000 iterations. This could be addressed, quite simply, by using multi-level BSA arrays - eg. a top level, 1000 entry array which contains the total number of visible entries within each 10000th of the main array would reduce search time to 125 iterations max of the high level BSA plus 250 iterations max of the relevant part of the BSA with not much extra complexity. 1000 probably wouldn't be the optimum size - it's just an illustration..

I hope the above isn't too confusing but I did try to keep the algorithm as simple as possible.

ataradov · « **Reply #7 on:** December 04, 2018, 02:09:50 am »

That's essentially what I described earlier. I guess I'll just apply some dynamic resizing of the BSAs to effectively split the actual data size. Memory required for that is acceptable.

Actually it might be easier to use variable size blocks to align at folding boundaries. Otherwise a fold that spills over to the next block will trigger complete recalculation if the following blocks.

splin · « **Reply #8 on:** December 04, 2018, 03:36:44 am »

Quote from: ataradov on December 04, 2018, 02:09:50 am

That's essentially what I described earlier.

True, but I tried to quantify it. Since you've done this before, how many 'operations' (eg. search iterations) are acceptable for smooth scrolling?

Quote

I guess I'll just apply some dynamic resizing of the BSAs to effectively split the actual data size.

If by that you mean using the entire BSA to cover only the currently in-use part of the main array rather than the entire address space, that will improve typical search performance but not the worst case - at the expense of complexity and a performance hit recalculating the whole BSA as data is added to the log. Doesn't seem a good trade-off to me if you have to make the performance adequate for a full buffer anyway. I've probably misunderstood your intent though.

Quote

Memory required for that is acceptable.

Are you referring to the extra memory that might be required to keep track of the dynamic sizing v static? It would be pretty negligable wouldn't it?

Quote

Actually it might be easier to use variable size blocks to align at folding boundaries. Otherwise a fold that spills over to the next block will trigger complete recalculation if the following blocks.

Not sure what you mean. If the BSA contained pointers/indicies into the main array identifying the first entry within that block then yes a change could change every subsequent BSA entry. But in my proposal, the BSA contains the numbers of visible entries in each block so a change can affect 2 entries at most.

ataradov · « **Reply #9 on:** December 04, 2018, 03:48:37 am »

Quote from: splin on December 04, 2018, 03:36:44 am

True, but I tried to quantify it. Since you've done this before, how many 'operations' (eg. search iterations) are acceptable for smooth scrolling?

It is not super critical. I have a mental number of a loop with 65K iterations being ok. It is only needed for the first line, consecutive lines are easy and fast.

Quote from: splin on December 04, 2018, 03:36:44 am

If by that you mean using the entire BSA to cover only the currently in-use part of the main array rather than the entire address space, that will improve typical search performance but not the worst case - at the expense of complexity and a performance hit recalculating the whole BSA as data is added to the log

That is what I meant. I'm thinking the typical data will be much less than 250 M entries, I just don't want things to slow down when the full capacity is reached. I will obviously see how expensive that recalculation will be.

Quote from: splin on December 04, 2018, 03:36:44 am

Are you referring to the extra memory that might be required to keep track of the dynamic sizing v static? It would be pretty negligable wouldn't it?

Yes, I would not consider consuming of up to 250 MBytes to be a problem at all.

Quote from: splin on December 04, 2018, 03:36:44 am

But in my proposal, the BSA contains the numbers of visible entries in each block so a change can affect 2 entries at most.

I had the same thing in mind, but I'm not 100% convinced that it is the case.

splin · « **Reply #10 on:** December 04, 2018, 05:39:13 am »

Quote from: ataradov on December 04, 2018, 03:48:37 am

Quote from: splin on December 04, 2018, 03:36:44 am
True, but I tried to quantify it. Since you've done this before, how many 'operations' (eg. search iterations) are acceptable for smooth scrolling?
It is not super critical. I have a mental number of a loop with 65K iterations being ok. It is only needed for the first line, consecutive lines are easy and fast.

I'm not a PC programmer but that sounds reasonable. I'm not sure what you mean by the 'consecutive lines are easy' bit - surely you have to do the complete calculation every time the O/S tells you that the scrollbar has been moved? It could have moved a small amount but equally could have moved end to end in the time that the O/S gives your progam control again if a higher priority process happened to be running for a few 10s of milli-seconds.

Quote

Quote from: splin on December 04, 2018, 03:36:44 am
If by that you mean using the entire BSA to cover only the currently in-use part of the main array rather than the entire address space, that will improve typical search performance but not the worst case - at the expense of complexity and a performance hit recalculating the whole BSA as data is added to the log
That is what I meant. I'm thinking the typical data will be much less than 250 M entries, I just don't want things to slow down when the full capacity is reached. I will obviously see how expensive that recalculation will be.

Optimizing the typical case will generally make the worst case worse - though perhaps not by much. In the case of a GUI you probably have little to gain by improving the typical case since your application probably isn't doing much else, if anything, at this point. If you make it fast enough for the worst case then it should be fast enough for all other cases, so there is no point making it more complex to improve the typical case. It would be different if this were a server application where total CPU load should be minimised.

Quote

Quote from: splin on December 04, 2018, 03:36:44 am
Are you referring to the extra memory that might be required to keep track of the dynamic sizing v static? It would be pretty negligable wouldn't it?
Yes, I would not consider consuming of up to 250 MBytes to be a problem at all.

I assume you mean 250 MBytes in total, in addition to the 4GB data log, for this application. I thought you were referring to the extra memory for dynamic sizing v static which should not be significant.

Quote

Quote from: splin on December 04, 2018, 03:36:44 am
But in my proposal, the BSA contains the numbers of visible entries in each block so a change can affect 2 entries at most.
I had the same thing in mind, but I'm not 100% convinced that it is the case.

If you have 250M entries in the log and 250 entries in the BSA each of which holds the number of visible log entries in each 1-million entry block of the log. If every entry is visible then the maximum value of each BSA entry is 1 million. Fold one or more within a block and the corresponding BSA entry will be reduced accordingly. If the folded entries straddle two blocks then both the corresponding BSA entries will be affected. Equally when an entry is expanded - it can't change the BSA entries of any blocks that don't include those expanded log entries, and none can be increased beyond 1-million.

In general, use the simplest possible solution until you have some evidence that it isn't, or probably won't be, good enough - or you have sufficient free time/resources to 'improve' it if it you believe it will be worthwhile.

ataradov · « **Reply #11 on:** December 04, 2018, 05:48:34 am »

Quote from: splin on December 04, 2018, 05:39:13 am

I'm not sure what you mean by the 'consecutive lines are easy' bit - surely you have to do the complete calculation every time the O/S tells you that the scrollbar has been moved?

The window displays more than one line. But once you found the index of the first one, finding the following ones is straightforward.

Quote from: splin on December 04, 2018, 05:39:13 am

In the case of a GUI you probably have little to gain by improving the typical case since your application probably isn't doing much else, if anything, at this point. If you make it fast enough for the worst case then it should be fast enough for all other cases, so there is no point making it more complex to improve the typical case.

That is probably correct. I may be overthinking this.

Quote from: splin on December 04, 2018, 05:39:13 am

I assume you mean 250 MBytes in total, in addition to the 4GB data log, for this application.

Yes, I meant the direct overhead for the additional markup information.

Quote from: splin on December 04, 2018, 05:39:13 am

Equally when an entry is expanded - it can't change the BSA entries of any blocks that don't include those expanded log entries, and none can be increased beyond 1-million.

Yes, I now see that it will always be the case.

I guess I've got a plan to test.

rhb · « **Reply #12 on:** December 05, 2018, 04:12:58 am »

I hate to say this, but the answer is an M204 style database aka flat file array. I processed many hundreds of GBs of oil well data in ASCII this way. I used tens of thousands of files in a 3 level hierarchy of subdirectories to keep the directory access times reasonable. Almost everything was Bourne shell and awk. It's dead simple to code and runs like a bat out of hell.

In my case I could not keep stuff in memory so I had to do grep(1) style processing using intermediate files as this was over 10 years ago. It took a lot of stroke to get 2 TB of disk all to myself.

If you need to do searches on multiple fields, a relational DB will eat you alive thrashing the cache doing pointer dereferencing.

If you absolutely, positively must go as fast as possible, track user query fields and generate sorted arrays for each heavily used field that you can query using bisection. You can guarantee worst case performance for bisection of a sorted list. Everything else will eat you alive on the worst case.

"When in doubt, use brute force." Ken Thompson (according to wikipedia) . My recollection is "Never underestimate the value of brute force." Both may be accurate considering the origin.

Nominal Animal · « **Reply #13 on:** December 05, 2018, 05:39:12 am »

I'd chunk the data. Each chunk (except the last) contains a reference to the array data (a span of a hundred entries or so), and the rendered height. At any point in time, at most two chunks are displayed at the same time. Whenever an entry is modified/opened/closed, the chunk size is recomputed; note that only a visible chunk can be modified.

The widget displaying the data is essentially a virtual canvas with a scrollbar, where you render the current chunks.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Looking for ideas: data logger algorithms and data structures. (Read 2462 times)

ataradov

Looking for ideas: data logger algorithms and data structures.

Kalvin

Re: Looking for ideas: data logger algorithms and data structures.

ataradov

Re: Looking for ideas: data logger algorithms and data structures.

Kalvin

Re: Looking for ideas: data logger algorithms and data structures.

ataradov

Re: Looking for ideas: data logger algorithms and data structures.

Kalvin

Re: Looking for ideas: data logger algorithms and data structures.

splin

Re: Looking for ideas: data logger algorithms and data structures.

ataradov

Re: Looking for ideas: data logger algorithms and data structures.

splin

Re: Looking for ideas: data logger algorithms and data structures.

ataradov

Re: Looking for ideas: data logger algorithms and data structures.

splin

Re: Looking for ideas: data logger algorithms and data structures.

ataradov

Re: Looking for ideas: data logger algorithms and data structures.

rhb

Re: Looking for ideas: data logger algorithms and data structures.

Nominal Animal

Re: Looking for ideas: data logger algorithms and data structures.

Share me