So I think I've figured something out. Not much of a theoretical basis or anything, more circumstantial, but satisfactory enough to share, I feel.
Two things:
1. The permutation problem isn't a problem
2. Long words suck
1. Permutations
So, the FSST paper brings up this problem. It's described... and then completely forgotten. So it's, uh, still a poorly written paper, in this respect... But I see why they didn't concentrate on it. (Still, it would've been nice to know this just from reading.)
Suppose you have a string, with regions of differing entropy (assuming we have some way to define this*). Given a dictionary over that string, we are unlikely to find many good matches (if any at all), in the high entropy regions; and while we are likely to find many matches in the low entropy region, most of them will be short, or misaligned, so it turns out there are only a few good matches we really need to be concerned with.
*For example, we might generate the string using different sets of probabilities for those regions. In this way, we might know the exact probabilities of each symbol in each position; of course, this will only be apparent from inspecting the string, if it's relatively long compared to the chances of symbols appearing in the respective regions.
To test this, I wrote an algorithm to exhaustively search for matches of a given length (in this case, a certain number of words, or until end of block). This is, understandably, quite computation intensive (presumably, exponential in match length or number of words -- though I think the base is relatively low, because again, not actually all that many matches can be made at a given point). So I've kept the limit low, and seen how performance varies.
This is that algorithm:
function codecDictTranslateWordsDynprog(dict, blocks) {
const TEST_WORDS = 3; // Number of words to test
var outbuf = []; outbuf.length = blocks.length;
outbuf.use = []; outbuf.use.length = dict.length;
for (var i = 0; i < dict.length; i++) outbuf.use[i] = 0;
for (var i = 0; i < blocks.length; i++) {
var totalLength = 0;
outbuf[i] = [];
do {
var w = findBestPrefixWord(dict, blocks[i], totalLength, TEST_WORDS).word;
outbuf[i].push(w);
outbuf.use[w]++;
totalLength += dict[w].length;
} while (totalLength < blocks[i].length);
}
return outbuf;
function findBestPrefixWord(dict, block, start, words) {
// Try all combinations of words, matching from block[start] onward,
// and return the first word of the combination which covers the
// most total symbols, or which fills the block, from start to
// block.length - 1, using the fewest words
var best = {word: 0, len: 0, unused: 0};
var f;
// Easy out: it's already done
if (start >= block.length || words == 0) {
return best;
}
// Also easy out, if there's only one symbol remaining: use trivial word
if (start == block.length - 1) {
best.word = block[start]; best.len = 1; best.unused = words - 1;
return best;
}
// Find all matching words, longest first
// (dict is sorted by length, except for trivial words at the start, skip over them)
var s = block.slice(start);
for (var i = dict.findIndex(a => a.length > 1); i < dict.length; i++) {
if (codecDictArrayMatch(s, dict[i])) {
if (words == 1 || start + dict[i].length == block.length) {
// Used all words, or exact fit -- we're done
best.word = i; best.len = dict[i].length; best.unused = words - 1;
return best;
}
f = findBestPrefixWord(dict, block, start + dict[i].length, words - 1);
f.len += dict[i].length;
if (f.len > best.len || f.unused > best.unused) {
best.word = i; best.len = f.len; best.unused = f.unused;
}
}
}
// Finally, check trivial word
f = findBestPrefixWord(dict, block, start + 1, words - 1);
if (f.len + 1 > best.len || f.unused > best.unused) {
best.word = block[start]; best.len = f.len + 1; best.unused = f.unused;
}
return best;
}
}
function codecDictArrayMatch(block, word) {
if (block.length < word.length) return 0;
for (var i = 0; i < word.length; i++) {
if (block[i] != word[i]) return 0;
}
return word.length;
}
codecDictTranslateWordsDynprog is the "helper" function for the codec (hence the naming convention). It takes in an array of blocks, so handles looping over those blocks. In turn, findBestPrefixWord works on a given block, looking from the starting offset, until either all TEST_WORDS words have been checked, or end of block. It returns the first word with the best match, hence, the best matching prefix.
Note that each block is translated a word at a time, using that one word returned from findBestPrefixWord. So, not only is it [probably] exponential, it's repeatedly so! (But wait, there's more -- see pruning later.)
Regarding computing and data structures here -- building a matching string from words, can be viewed as a tree process. At the root is the starting offset, and each node has a set of branches, being the words which match at that position. The breadth of the tree presumably grows exponentially with depth, hence our exponential growth. We are searching for the nearest leaf element, i.e., a least-depth search. Using a stack (as plain old recursion does), we end up doing a depth-first search, and so have to scan from left to right, the entire tree. Because, who knows, the next branch over might have our least-depth leaf. I wrote it this way, because it's easier, and because it can be rewritten to use an explicit stack object. And then the stack can be converted to a queue (minding to include state variables that are covered implicitly by the stack method -- in this case, the path from root to each queued node, I think has to be stored), and finally we can do a breadth-first search.
I also wrote it this way, because a breadth-first search won't save much, most of the time: the number of words (nodes) is fixed, so unless we bang into the end of the block (and thus save one more recursion, one additional word: the best savings is not writing more data at all), we have to search the entirety of the space anyway. So in the average case, in the middle of a block, this can't save anything, anyway. (If the process can be sped up so that an entire block at a time can be efficiently searched, this would be the search for the shortest sequence of words that fills the block. That's not something I'm going to develop just yet, though...)
(Some time could be saved by realizing it's a directed acyclic graph, i.e., some nodes will coincide, between branches. These don't need to be retested (i.e., the sub-problem of matching words at a given offset, is common to several branches, at various points in the graph). Alternately, matches could be cached; which would be annoying to implement in explicit recursive form I think?, but easily done in a loop form. Which might also be solved with say, a priority queue, so for every word matched at a given offset, the total length matched is recorded; and as we build sequences of words, we find some with the same total length. We use length as priority, and all items of given priority can be merged together into the same subsequent path. I wonder if this is strong enough to counter the exponential growth? Hmm.)
The translation is combined with a dictionary seeding and pruning method. The main encoding loop is:
var progress;
do {
progress = false;
// Dictionary-ify it
outbuf = codecDictTranslateWordsGreedy(state.dict, state.blocks);
//outbuf = codecDictTranslateWordsDynprog(state.dict, state.blocks);
// Prune poorly used words
for (var i = state.syms; i < state.dict.length; i++) {
if ((outbuf.use[i] == 1) ||
(state.dict[i].length == 2 && outbuf.use[i] <= 3) ||
(state.dict[i].length == 3 && outbuf.use[i] <= 2)) {
progress = true;
state.dict.splice(i, 1);
outbuf.use.splice(i, 1);
i--;
}
}
} while (progress);
// Finally, prune unused words
for (var i = state.syms; i < state.dict.length; i++) {
if (outbuf.use[i] == 0) {
state.dict.splice(i, 1);
outbuf.use.splice(i, 1);
i--;
for (var j = 0; j < outbuf.length; j++) {
for (var k = 0; k < outbuf[j].length; k++) {
if (outbuf[j][k] > i)
outbuf[j][k]--;
}
}
}
}
(Again, note that dict[0 ... state.syms] is the set of trivial (length 1) words. So those are skipped over, in operations on the set of nontrivial words.)
The dictionary is initially seeded from a suffix array, taking every prefix that appears more than once. This doesn't quite populate it with every possible substring, but it gets close. (Sometimes, maybe a word of length 7 is present, and substrings of length 5, but not 6. Indeed, it seems such words would be very unlikely indeed to be useful -- they don't appear more than once, at all!)
(FYI: a suffix array, is an array of offsets into a given string, sorted lexicographically. Here, the string is generated by concatenating all blocks, separated with unique EOL symbols. So, finding all possible candidate words, is simply a matter of counting how many characters match, for substrings pointed to by adjacent elements. Note that the suffix array will overcount some words, for example the sequence ABABAB will have suffixes AB, ABAB, BAB and BABAB, while we might only make use of AB in the end.)
Main insight for this section:
This dictionary pruning process, seems to have much more promise than the translating step.
The intent behind gently pruning the dictionary, is that as lightly-used words disappear, formerly-unused words may become chosen, and so on. So I don't clear unused words right away. Completely unused words can be cleared in a separate step with no re-translation necessary, and that's shown at the bottom.
I also tried it with total pruning (unused and poorly used words). See below.
On to experimental. Here are the measurements:
Setup:
Using a smaller file, FontLucidiaBoldNumbers.png -- this has 20 characters, 2.431 bits/pixel, 1950 pixels total. As you might guess from the name, it's denser in black/white ratio, but has about the same (0 and 100%) / (all others) ratio. It's antialiased 16-color grayscale text, like the other testing examples.
Results are bytes total (dictionary + encoded blocks), and the three encoding directions / blocking methods in the encoder tool have each been tested. Rows gives many small blocks (200 total), Blocks use the blocks as given (per glyph) while scanning them in whichever direction.
Results:
Greedy without dictionary pruning (remove zeroes only):
Rows: 1735, Blocks, row: 2169, Blocks, col: 2021
Greedy translation with dictionary pruning (remove zeroes and poorly-used):
Rows: 1251, Blocks, row: 1448, Blocks, col: 1352
Dynamic programming translation, without pruning, TEST_WORDS = 4:
Rows: 1735, Blocks, row: 2162, Blocks, col: 2034
Dynamic programming translation, with pruning (remove zeroes and poorly-used), TEST_WORDS = 2:
Rows: 1230, Blocks, row: 1443, Blocks, col: 1298
Dynamic, pruning, TEST_WORDS = 3:
Rows: 1229, Blocks, row: 1426, Blocks, col: 1276
Dynamic, pruning, TEST_WORDS = 4:
Rows: 1229, Blocks, row: 1426, Blocks, col: 1263
Dynamic, pruning, TEST_WORDS = 5:
Rows: 1229, Blocks, row: 1426, Blocks, col: 1256
Dynamic, pruning, TEST_WORDS = 6:
Rows: 1229, Blocks, row: 1426, Blocks, col: 1256
Dynamic programming translation, pruning poorly used words only (leaving unused until cleanup), TEST_WORDS = 5:
Rows: 1238, Blocks, row: 1246, Blocks, col: 1142
Dynamic, pruning poorly used words only, TEST_WORDS = 3:
Rows: 1236, Blocks, row: 1250, Blocks, col: 1172
Dynamic, pruning poorly used words only, entropy based word length limit, TEST_WORDS = 3:
Rows: 1182, Blocks, row: 1143, Blocks, col: 1080
Greedy, pruning poorly used words only, entropy based word length limit:
Rows: 1172, Blocks, row: 1164, Blocks, col: 1084
For reference, the 1-2-6 Huffman code gives:
Rows: 647, Blocks, row: 623, Blocks, col: 611
And RLE gives:
Rows: 744, Blocks, row: 663, Blocks, col: 579
So, the dictionary method still has a long way to go before it's competitive with these simpler methods (if it ever can be).
I will note that, because the dictionary is stored raw, there's a "free" almost half size savings, from simply packing it in nibbles rather than bytes. This is contingent on word length and symbol set being 0..15 of course (not a great assumption in general, but fully applicable to my test cases).
2. Word Length
I think I can somewhat justify this. Consider a sequence of (measured per-symbol) entropy e bits/symbol. We're encoding that sequence with 8-bit bytes, so to maximize the value of encoded data:
a. The dictionary should be near 256 words long
b. The expected length of a given word will be around 8/e symbols long
To this end, I added a simple pruning step: exclude words of length >= Math.ceil(12 / e). This is the "entropy based word length limit" in the last few measurements.
Discussion / Conclusion
As you can see, word length alone accounts for a bigger gain than any degree of word searching! This explains why the FSST authors concentrate on tuning the dictionary, sticking to a simple greedy encoder.
Markov table: still haven't done it. Not quite sure how to direct decisions between merging and splitting words, or how to express, measure, or optimize, entropy given those data. Like, should all elements be evenly distributed (similar magnitude)? It should be sparse, certainly, but what kind of margins should be placed on that?* The FSST authors didn't describe their decision process, and I couldn't make sense of their C++ code...
*I have a bit of an idea; for sure, the gain of a given dictionary word is <= 1, if it's not used enough at all (specifically for my case: <= 3 uses for length 2, <= 2 for length 3, and < 2 for any other length). We could bet on higher gains, given the entropy -- but I'm not sure at this time what relation should be used to turn that into a use threshold. Hmm, it might even be a curve: instead of removing long words, I could simply weight long words more heavily; that way, they're still a possibility, but only if they're really effective, much more effective than expected. Hmm..
Also, since dictionary size equals encoding efficiency, it may simply be that this method cannot be efficient for the particular data (and strategies, to the extent that I'm reluctant to further encode the data, e.g. adaptive Huffman encode the output data, and fixed or adaptive Huffman, or RLE, or nibble-pack, the dictionary and header) I'm considering.
Certainly, gain is to be had, on a larger data set that's able to fill up the dictionary, and for a symbol set that better fills out the dictionary encoding. Give or take how to encode trivial symbols; the approach taken by FSST is fine for their plain text, or for, say, palettized color graphics.
Tim