Author Topic: Dictionary Coding for Embedded  (Read 2253 times)

0 Members and 1 Guest are viewing this topic.

Offline T3sl4co1lTopic starter

  • Super Contributor
  • ***
  • Posts: 22436
  • Country: us
  • Expert, Analog Electronics, PCB Layout, EMC
    • Seven Transistor Labs
Dictionary Coding for Embedded
« on: December 21, 2020, 11:13:33 am »
Okay, think I've tried it enough ways and got it working enough to discuss.

Source: https://github.com/T3sl4co1l/fontenc
Live: https://htmlpreview.github.io/?https://github.com/T3sl4co1l/fontenc/blob/master/font.html

Usage: recommend downloading one of the test images,
https://raw.githubusercontent.com/T3sl4co1l/fontenc/master/FontLucidia.png
https://raw.githubusercontent.com/T3sl4co1l/fontenc/master/Lucidia_16px.png
and loading that.  The formatting is described below.  (The rest of the notes and stuff at the bottom of the page still isn't updated, nevermind that.)

Most encodings produce results quickly, the odd one out is the dictionary encoder which is still quite slow (for a couple reasons), which is the subject of this thread.

So, on the page, it's not very interesting, it cranks out an image which is hopefully a copy of the original, and dumps some text showing the output used to create it (in plaintext or C format; binary is unimplemented).

The dictionary codec is implemented starting on line 1850.


So, with introductions out of the way --

Motivation:
- Store compressed graphics data in Flash
- Using minimal RAM buffering
- Supports random access of many small blocks (characters of a font)
- The data tend to be sparse (low entropy) and use relatively few colors (my main interest is 16 colors, grayscale).

I expect this requires a fixed overhead: the character widths, height(s), and where to find them (data lengths or offsets).  These data could be compressed again of course, but subject to the same restrictions -- that seems unlikely to pay off, so I'm not worrying about it for now.

So, Huffman and RLE perform well (try it!).  The decoders are... somewhat complex, but nothing too bad on most platforms.  The output can simply be streamed to whatever destination, no buffering needed.

I wanted to test a dictionary coding as well.

Stock LZ77 codes require a RAM buffer, and perform terribly on short data.  Or if the whole data set could be compressed, that would be great, but how are you supposed to access that without just expanding the whole damn thing into a buffer?

It would be good if the blocks shared a common dictionary.  That makes access harder -- keeping the RLE functionality would have to specify which memory space it's reading from, or something.  The blocks are short, maybe not many runs would be used anyway.  It's probably acceptable to just drop that feature, and stick with lone dictionary words -- this will be a handful of lookups and can stream directly from the dictionary buffer to the output, no buffering required; easy enough.

Indeed, this decoder only needs a couple dozen lines of JS (lines 2077-2099), and probably not much more in C (I haven't implemented it there yet; low priority).

That leaves the encoder.  Which, I suspect is an NP-complete problem?  It can be restated as a partition problem subject to constraints (minimum cost function), and it's not obvious at any step of the way whether a move will be overall profitable or not (shrinking or growing one word might give better use in one place, but cause a cascade of poor word choices elsewhere).

So, it could take a very long time indeed to encode something perfectly.  But anyway, we have algorithms to deal with that, don't let perfect be the enemy of good and all that.

What I've written is a special case taking advantage of the many blocks provided, just sorting across those and finding likely candidates.  (It doesn't compress single blocks at all.  Easily enough solved by cutting the input into smaller (some maximum length) blocks and pasting them back together later.)  It's a greedy algorithm, picking the largest common prefix across all blocks with each pass.  It runs in, erm... something like O(N^2) or ^3 I guess, much optimization is possible.

Alas, it seems to do quite poorly, beating none of the others (besides packed nibbles and raw).

Commented out, I also tried a suffix array sorting method, which didn't seem to do much different.  Possibly the dictionary matching step was too greedy?  It would be nice to assign words at the suffix array step, but it's not possible to avoid collisions (and sorting them out is the tradeoff problem once again).

Have also tried very little post optimizing.  It would be nice to look at words with low usage, and break them into other combinations, perhaps looking at their neighbors and allowing them to grow if possible -- but again, it quickly grows to exponential complexity.  I don't know how to manage that, on top of keeping all the operations consistent (I've had enough trouble getting correct, non-hanging results as it is :palm: ... )

Also, it does seem like there's less entropy in the result than there ought to -- I can see why LZ77 (LZ77 exactly, or one of the derivatives? I'm not sure) uses Huffman encoding on top of everything.  In particular, I could simply use the 1-2-6 code to compress the dictionary words -- not to any benefit on the shortest words, but the longer ones would stand to save some bytes.  Unfortunately the decoder's code space will outweigh that, so...

Tim
Seven Transistor Labs, LLC
Electronic design, from concept to prototype.
Bringing a project to life?  Send me a message!
 
The following users thanked this post: mrflibble, I wanted a rude username

Offline T3sl4co1lTopic starter

  • Super Contributor
  • ***
  • Posts: 22436
  • Country: us
  • Expert, Analog Electronics, PCB Layout, EMC
    • Seven Transistor Labs
Re: Dictionary Coding for Embedded
« Reply #1 on: December 22, 2020, 04:31:52 am »
Hah, by complete coincidence, this was just posted on Reddit:
https://github.com/cwida/fsst

According to the video, it's driven by a Markov chain process.  That sounds promising...

Tim
Seven Transistor Labs, LLC
Electronic design, from concept to prototype.
Bringing a project to life?  Send me a message!
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 4559
  • Country: nz
Re: Dictionary Coding for Embedded
« Reply #2 on: December 22, 2020, 04:43:05 am »
What are you actually trying to minimize?

The size of the font? Just this font, or any font?

The font itself, or the encoded font + the program to expand/access it?
 

Offline T3sl4co1lTopic starter

  • Super Contributor
  • ***
  • Posts: 22436
  • Country: us
  • Expert, Analog Electronics, PCB Layout, EMC
    • Seven Transistor Labs
Re: Dictionary Coding for Embedded
« Reply #3 on: December 22, 2020, 05:06:17 am »
Yes, encoded data plus decoder, to the extent that the decoder is known -- for example the Huffman I think was around 100 words in AVR Flash, and as mentioned, I haven't implemented this yet so I don't know how much less it might be.  So if the encoded data happen to be larger by about that much, that would be, at least a wash.  With it not known right now, it's fine, the data are much larger (~kword).

Like, arithmetic coding isn't bad, but it'll take a lot more code (including muls or even divs) and especially execution time, so I'm not worrying about it.  It's there, more for completeness, than for practicality.

Tim
« Last Edit: December 22, 2020, 05:10:25 am by T3sl4co1l »
Seven Transistor Labs, LLC
Electronic design, from concept to prototype.
Bringing a project to life?  Send me a message!
 

Offline I wanted a rude username

  • Frequent Contributor
  • **
  • Posts: 645
  • Country: au
  • ... but this username is also acceptable.
Re: Dictionary Coding for Embedded
« Reply #4 on: December 22, 2020, 08:57:18 am »
Bit off-topic, but if you only need to encode ASCII you can reassign codepoints 128-254 as an index into a 127-entry substitution dictionary. Nowhere near optimal compression, but trivial to implement the decoder (encoding is still hard though).
 

Offline T3sl4co1lTopic starter

  • Super Contributor
  • ***
  • Posts: 22436
  • Country: us
  • Expert, Analog Electronics, PCB Layout, EMC
    • Seven Transistor Labs
Re: Dictionary Coding for Embedded
« Reply #5 on: December 22, 2020, 12:32:50 pm »
Yup, that's mentioned in the FSST paper; though they're doing full byte-valued strings, so they opt to escape plain bytes.  Which means they get to weight word compression even more heavily, so that even single-byte words are worth considering.

Since my images are low colors, I intend to do exactly that.  Essentially codes [0, N) (N = 16 for a typical case) are trivial 1-symbol words, while the rest are actual generated words in the dictionary.  I also trimmed off trailing [most-common-symbol]s, in some cases giving "infinite" compression ratio (i.e., the space character is all white).  My data are quite low entropy (~1.5 bits/symbol) so this is pretty reasonable; for string data it probably wouldn't be as effective (unless something like, you happen to have a lot of space- or null-padded text?).

Also also, that's kind of what Unicode does, effectively 7-bit ASCII "compresses" in UTF-8 at 8 bits/character, while basic [non-ASCII] code points take 16 or more bits; which makes UTF-8 rather convenient in the western world, while putting it at a slight disadvantage to UTF-16 in the rest of the world.  USA, USA? :P

Tim
Seven Transistor Labs, LLC
Electronic design, from concept to prototype.
Bringing a project to life?  Send me a message!
 
The following users thanked this post: I wanted a rude username

Offline bson

  • Supporter
  • ****
  • Posts: 2474
  • Country: us
Re: Dictionary Coding for Embedded
« Reply #6 on: December 22, 2020, 11:35:25 pm »
The main problem with encoding, at all, is that you can't point a DMA controller at it and the bus interface to a display controller...  Which makes it hard to burst out characters to the display controller at bus speed.  For more complex encoding, like huffman, you probably want a font cache in SRAM as it would be rather expensive to decode each glyph while outputting it.  RLE is hard to beat if you don't want a cache but want some flash savings.  If flash is plentiful, storing it uncompressed lets you dump it straight from flash with DMA, at least if you have a suitable bus interface and a controller that can write to an arbitrary rectangle (like the Salomons etc).  Of course, with two buffers you can decode to one while the DMA dumps out the previous one, especially if they're in two different SRAM blocks on separate buses. Not sure if there's any savings doing that over just decoding directly to the display controller though.

« Last Edit: December 22, 2020, 11:42:26 pm by bson »
 

Offline T3sl4co1lTopic starter

  • Super Contributor
  • ***
  • Posts: 22436
  • Country: us
  • Expert, Analog Electronics, PCB Layout, EMC
    • Seven Transistor Labs
Re: Dictionary Coding for Embedded
« Reply #7 on: December 23, 2020, 12:14:52 am »
Yeah, DMA makes for some interesting opportunities, especially complex ones that can control themselves as a crude state machine (depending on how many channels are available, and how many you wish to commit to the task).

Most LCDs are 16+ bits, which means my 16px font takes up a massive 26.6-40kB just sitting there -- raw and fully expanded.  And that's just one color of it.  (I plan to use palettized grayscale, with a live palette generated based on foreground and background colors -- this takes e.g. 32-48 bytes RAM to store the live palette, far less than caching even a single character.)

If you can use several DMA channels to create a lookup-table function, that's effortless, and you can put it directly into RAM (internal controller) or SPI (external).  Maybe not so easy for parallel external controller, depends if you can access it as a bus device (memory interface or parallel port?) or have to bit-bang it yourself through plain GPIOs.

And the dictionary decode process is simple enough to do, at least in part, via DMA; not sure about the whole thing.  It won't be worth calling up the CPU to queue each and every new word, but if the whole thing can be done in DMA, sure.

And GPIOs may still be DMA-able; of course, every added function demands more channels, which are usually not that abundant.

In that case, buffering words or a character or a line into RAM, then off to the display, would be a reasonable compromise.  Throughput isn't going to be any better than doing it in software, but it keeps the CPU free.

But that's a very late optimization, and only available on some systems; if the base case is suitable for, say, an ATmega328p, it'll work on pretty much anything else too.

Tim
« Last Edit: December 23, 2020, 12:17:37 am by T3sl4co1l »
Seven Transistor Labs, LLC
Electronic design, from concept to prototype.
Bringing a project to life?  Send me a message!
 

Offline I wanted a rude username

  • Frequent Contributor
  • **
  • Posts: 645
  • Country: au
  • ... but this username is also acceptable.
Re: Dictionary Coding for Embedded
« Reply #8 on: December 28, 2020, 03:41:00 am »
effectively 7-bit ASCII "compresses" in UTF-8 at 8 bits/character, while basic [non-ASCII] code points take 16 or more bits; which makes UTF-8 rather convenient in the western world, while putting it at a slight disadvantage to UTF-16 in the rest of the world.

I recall an analysis concluded this would probably still be a net win for CJK Web content, because of all the HTML, CSS, etc. markup which would benefit from UTF-8's compactness.

On devices that only need a small subset of characters, you can extend this even further, e.g. to a 5-bit Baudot-like encoding. For English text with a minimal subset of punctuation, this performs not much worse than Huffman, and is simpler to decode even taking into account the need to read unaligned 5-bit characters. Lacking a distinction between uppercase and lowercase is pretty oldskool though ...
 

Offline T3sl4co1lTopic starter

  • Super Contributor
  • ***
  • Posts: 22436
  • Country: us
  • Expert, Analog Electronics, PCB Layout, EMC
    • Seven Transistor Labs
Re: Dictionary Coding for Embedded
« Reply #9 on: February 01, 2021, 07:45:02 am »
So I think I've figured something out.  Not much of a theoretical basis or anything, more circumstantial, but satisfactory enough to share, I feel.

Two things:
1. The permutation problem isn't a problem
2. Long words suck


1. Permutations

So, the FSST paper brings up this problem.  It's described... and then completely forgotten.  So it's, uh, still a poorly written paper, in this respect...  But I see why they didn't concentrate on it.  (Still, it would've been nice to know this just from reading.)

Suppose you have a string, with regions of differing entropy (assuming we have some way to define this*).  Given a dictionary over that string, we are unlikely to find many good matches (if any at all), in the high entropy regions; and while we are likely to find many matches in the low entropy region, most of them will be short, or misaligned, so it turns out there are only a few good matches we really need to be concerned with.

*For example, we might generate the string using different sets of probabilities for those regions.  In this way, we might know the exact probabilities of each symbol in each position; of course, this will only be apparent from inspecting the string, if it's relatively long compared to the chances of symbols appearing in the respective regions.

To test this, I wrote an algorithm to exhaustively search for matches of a given length (in this case, a certain number of words, or until end of block).  This is, understandably, quite computation intensive (presumably, exponential in match length or number of words -- though I think the base is relatively low, because again, not actually all that many matches can be made at a given point).  So I've kept the limit low, and seen how performance varies.

This is that algorithm:

Code: [Select]
function codecDictTranslateWordsDynprog(dict, blocks) {

const TEST_WORDS = 3; // Number of words to test

var outbuf = []; outbuf.length = blocks.length;
outbuf.use = []; outbuf.use.length = dict.length;
for (var i = 0; i < dict.length; i++) outbuf.use[i] = 0;
for (var i = 0; i < blocks.length; i++) {
var totalLength = 0;
outbuf[i] = [];
do {
var w = findBestPrefixWord(dict, blocks[i], totalLength, TEST_WORDS).word;
outbuf[i].push(w);
outbuf.use[w]++;
totalLength += dict[w].length;
} while (totalLength < blocks[i].length);
}
return outbuf;

function findBestPrefixWord(dict, block, start, words) {
// Try all combinations of words, matching from block[start] onward,
// and return the first word of the combination which covers the
// most total symbols, or which fills the block, from start to
// block.length - 1, using the fewest words

var best = {word: 0, len: 0, unused: 0};
var f;
// Easy out: it's already done
if (start >= block.length || words == 0) {
return best;
}
// Also easy out, if there's only one symbol remaining: use trivial word
if (start == block.length - 1) {
best.word = block[start]; best.len = 1; best.unused = words - 1;
return best;
}
// Find all matching words, longest first
// (dict is sorted by length, except for trivial words at the start, skip over them)
var s = block.slice(start);
for (var i = dict.findIndex(a => a.length > 1); i < dict.length; i++) {
if (codecDictArrayMatch(s, dict[i])) {
if (words == 1 || start + dict[i].length == block.length) {
// Used all words, or exact fit -- we're done
best.word = i; best.len = dict[i].length; best.unused = words - 1;
return best;
}
f = findBestPrefixWord(dict, block, start + dict[i].length, words - 1);
f.len += dict[i].length;
if (f.len > best.len || f.unused > best.unused) {
best.word = i; best.len = f.len; best.unused = f.unused;
}
}
}
// Finally, check trivial word
f = findBestPrefixWord(dict, block, start + 1, words - 1);
if (f.len + 1 > best.len || f.unused > best.unused) {
best.word = block[start]; best.len = f.len + 1; best.unused = f.unused;
}
return best;
}
}
function codecDictArrayMatch(block, word) {
if (block.length < word.length) return 0;
for (var i = 0; i < word.length; i++) {
if (block[i] != word[i]) return 0;
}
return word.length;
}

codecDictTranslateWordsDynprog is the "helper" function for the codec (hence the naming convention).  It takes in an array of blocks, so handles looping over those blocks.  In turn, findBestPrefixWord works on a given block, looking from the starting offset, until either all TEST_WORDS words have been checked, or end of block.  It returns the first word with the best match, hence, the best matching prefix.

Note that each block is translated a word at a time, using that one word returned from findBestPrefixWord.  So, not only is it [probably] exponential, it's repeatedly so!  (But wait, there's more -- see pruning later.)

Regarding computing and data structures here -- building a matching string from words, can be viewed as a tree process.  At the root is the starting offset, and each node has a set of branches, being the words which match at that position.  The breadth of the tree presumably grows exponentially with depth, hence our exponential growth.  We are searching for the nearest leaf element, i.e., a least-depth search.  Using a stack (as plain old recursion does), we end up doing a depth-first search, and so have to scan from left to right, the entire tree.  Because, who knows, the next branch over might have our least-depth leaf.  I wrote it this way, because it's easier, and because it can be rewritten to use an explicit stack object.  And then the stack can be converted to a queue (minding to include state variables that are covered implicitly by the stack method -- in this case, the path from root to each queued node, I think has to be stored), and finally we can do a breadth-first search.

I also wrote it this way, because a breadth-first search won't save much, most of the time: the number of words (nodes) is fixed, so unless we bang into the end of the block (and thus save one more recursion, one additional word: the best savings is not writing more data at all), we have to search the entirety of the space anyway.  So in the average case, in the middle of a block, this can't save anything, anyway.  (If the process can be sped up so that an entire block at a time can be efficiently searched, this would be the search for the shortest sequence of words that fills the block.  That's not something I'm going to develop just yet, though...)

(Some time could be saved by realizing it's a directed acyclic graph, i.e., some nodes will coincide, between branches.  These don't need to be retested (i.e., the sub-problem of matching words at a given offset, is common to several branches, at various points in the graph).  Alternately, matches could be cached; which would be annoying to implement in explicit recursive form I think?, but easily done in a loop form.  Which might also be solved with say, a priority queue, so for every word matched at a given offset, the total length matched is recorded; and as we build sequences of words, we find some with the same total length.  We use length as priority, and all items of given priority can be merged together into the same subsequent path.  I wonder if this is strong enough to counter the exponential growth?  Hmm.)

The translation is combined with a dictionary seeding and pruning method.  The main encoding loop is:

Code: [Select]
var progress;
do {
progress = false;

// Dictionary-ify it
outbuf = codecDictTranslateWordsGreedy(state.dict, state.blocks);
//outbuf = codecDictTranslateWordsDynprog(state.dict, state.blocks);
// Prune poorly used words
for (var i = state.syms; i < state.dict.length; i++) {
if ((outbuf.use[i] == 1) ||
(state.dict[i].length == 2 && outbuf.use[i] <= 3) ||
(state.dict[i].length == 3 && outbuf.use[i] <= 2)) {
progress = true;
state.dict.splice(i, 1);
outbuf.use.splice(i, 1);
i--;
}
}
} while (progress);

// Finally, prune unused words
for (var i = state.syms; i < state.dict.length; i++) {
if (outbuf.use[i] == 0) {
state.dict.splice(i, 1);
outbuf.use.splice(i, 1);
i--;
for (var j = 0; j < outbuf.length; j++) {
for (var k = 0; k < outbuf[j].length; k++) {
if (outbuf[j][k] > i)
outbuf[j][k]--;
}
}
}
}

(Again, note that dict[0 ... state.syms] is the set of trivial (length 1) words.  So those are skipped over, in operations on the set of nontrivial words.)

The dictionary is initially seeded from a suffix array, taking every prefix that appears more than once.  This doesn't quite populate it with every possible substring, but it gets close.  (Sometimes, maybe a word of length 7 is present, and substrings of length 5, but not 6.  Indeed, it seems such words would be very unlikely indeed to be useful -- they don't appear more than once, at all!)

(FYI: a suffix array, is an array of offsets into a given string, sorted lexicographically.  Here, the string is generated by concatenating all blocks, separated with unique EOL symbols.  So, finding all possible candidate words, is simply a matter of counting how many characters match, for substrings pointed to by adjacent elements.  Note that the suffix array will overcount some words, for example the sequence ABABAB will have suffixes AB, ABAB, BAB and BABAB, while we might only make use of AB in the end.)


Main insight for this section:
This dictionary pruning process, seems to have much more promise than the translating step.


The intent behind gently pruning the dictionary, is that as lightly-used words disappear, formerly-unused words may become chosen, and so on.  So I don't clear unused words right away.  Completely unused words can be cleared in a separate step with no re-translation necessary, and that's shown at the bottom.

I also tried it with total pruning (unused and poorly used words).  See below.

On to experimental.  Here are the measurements:

Setup:

Using a smaller file, FontLucidiaBoldNumbers.png -- this has 20 characters, 2.431 bits/pixel, 1950 pixels total.  As you might guess from the name, it's denser in black/white ratio, but has about the same (0 and 100%) / (all others) ratio.  It's antialiased 16-color grayscale text, like the other testing examples.

Results are bytes total (dictionary + encoded blocks), and the three encoding directions / blocking methods in the encoder tool have each been tested.  Rows gives many small blocks (200 total), Blocks use the blocks as given (per glyph) while scanning them in whichever direction.

Results:

Greedy without dictionary pruning (remove zeroes only):
Rows: 1735, Blocks, row: 2169, Blocks, col: 2021

Greedy translation with dictionary pruning (remove zeroes and poorly-used):
Rows: 1251, Blocks, row: 1448, Blocks, col: 1352

Dynamic programming translation, without pruning, TEST_WORDS = 4:
Rows: 1735, Blocks, row: 2162, Blocks, col: 2034

Dynamic programming translation, with pruning (remove zeroes and poorly-used), TEST_WORDS = 2:
Rows: 1230, Blocks, row: 1443, Blocks, col: 1298

Dynamic, pruning, TEST_WORDS = 3:
Rows: 1229, Blocks, row: 1426, Blocks, col: 1276

Dynamic, pruning, TEST_WORDS = 4:
Rows: 1229, Blocks, row: 1426, Blocks, col: 1263

Dynamic, pruning, TEST_WORDS = 5:
Rows: 1229, Blocks, row: 1426, Blocks, col: 1256

Dynamic, pruning, TEST_WORDS = 6:
Rows: 1229, Blocks, row: 1426, Blocks, col: 1256

Dynamic programming translation, pruning poorly used words only (leaving unused until cleanup), TEST_WORDS = 5:
Rows: 1238, Blocks, row: 1246, Blocks, col: 1142

Dynamic, pruning poorly used words only, TEST_WORDS = 3:
Rows: 1236, Blocks, row: 1250, Blocks, col: 1172

Dynamic, pruning poorly used words only, entropy based word length limit, TEST_WORDS = 3:
Rows: 1182, Blocks, row: 1143, Blocks, col: 1080

Greedy, pruning poorly used words only, entropy based word length limit:
Rows: 1172, Blocks, row: 1164, Blocks, col: 1084


For reference, the 1-2-6 Huffman code gives:
Rows: 647, Blocks, row: 623, Blocks, col: 611

And RLE gives:
Rows: 744, Blocks, row: 663, Blocks, col: 579

So, the dictionary method still has a long way to go before it's competitive with these simpler methods (if it ever can be).

I will note that, because the dictionary is stored raw, there's a "free" almost half size savings, from simply packing it in nibbles rather than bytes.  This is contingent on word length and symbol set being 0..15 of course (not a great assumption in general, but fully applicable to my test cases).


2. Word Length

I think I can somewhat justify this.  Consider a sequence of (measured per-symbol) entropy e bits/symbol.  We're encoding that sequence with 8-bit bytes, so to maximize the value of encoded data:
a. The dictionary should be near 256 words long
b. The expected length of a given word will be around 8/e symbols long

To this end, I added a simple pruning step: exclude words of length >= Math.ceil(12 / e).  This is the "entropy based word length limit" in the last few measurements.


Discussion / Conclusion

As you can see, word length alone accounts for a bigger gain than any degree of word searching!  This explains why the FSST authors concentrate on tuning the dictionary, sticking to a simple greedy encoder.

Markov table: still haven't done it.  Not quite sure how to direct decisions between merging and splitting words, or how to express, measure, or optimize, entropy given those data.  Like, should all elements be evenly distributed (similar magnitude)?  It should be sparse, certainly, but what kind of margins should be placed on that?*  The FSST authors didn't describe their decision process, and I couldn't make sense of their C++ code...

*I have a bit of an idea; for sure, the gain of a given dictionary word is <= 1, if it's not used enough at all (specifically for my case: <= 3 uses for length 2, <= 2 for length 3, and < 2 for any other length).  We could bet on higher gains, given the entropy -- but I'm not sure at this time what relation should be used to turn that into a use threshold.  Hmm, it might even be a curve: instead of removing long words, I could simply weight long words more heavily; that way, they're still a possibility, but only if they're really effective, much more effective than expected.  Hmm..

Also, since dictionary size equals encoding efficiency, it may simply be that this method cannot be efficient for the particular data (and strategies, to the extent that I'm reluctant to further encode the data, e.g. adaptive Huffman encode the output data, and fixed or adaptive Huffman, or RLE, or nibble-pack, the dictionary and header) I'm considering.

Certainly, gain is to be had, on a larger data set that's able to fill up the dictionary, and for a symbol set that better fills out the dictionary encoding.  Give or take how to encode trivial symbols; the approach taken by FSST is fine for their plain text, or for, say, palettized color graphics.

Tim
Seven Transistor Labs, LLC
Electronic design, from concept to prototype.
Bringing a project to life?  Send me a message!
 
The following users thanked this post: I wanted a rude username

Offline T3sl4co1lTopic starter

  • Super Contributor
  • ***
  • Posts: 22436
  • Country: us
  • Expert, Analog Electronics, PCB Layout, EMC
    • Seven Transistor Labs
Re: Dictionary Coding for Embedded
« Reply #10 on: March 04, 2021, 06:00:32 am »
This is probably related, too, if perhaps oddly so:



I mean, most generally, they are all NP-complete problems (or, I at least presume the OP is), so that's not much of an accident.  But I hadn't realized there was a cellular automaton type analysis that applied to this game, that's cool as hell.  Anyway, the video's solution greatly resembles the solution I employed above; I guess my exhaustive search is more correctly called backtracking, then, though the proposed optimization I guess makes it dynamic programming.  Or something.  (Or wait, are those nested, not exclusive -- is backtracking a subset of dynamic programming?  Whichever..)  Unfortunately for me (but fortunately for the author!), there is no slam-dunk "this is guaranteed, mark it off and continue" for matching words by some unknowable fitness score.  But it's interesting to think in terms of that, and maybe it wouldn't be as awkward to try, as I think it is.

It also makes me think of my approach to PCB layout (also an NP-complete problem).  I line up the most likely candidates and route them first (small nets, cliques), then reconcile local and global routing, leading to the consideration of violating the earlier assumption (i.e., routing traces such that small nets or cliques are partially or wholly broken up), and finalizing the design.

Tim
Seven Transistor Labs, LLC
Electronic design, from concept to prototype.
Bringing a project to life?  Send me a message!
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf