Production code?
No, just wrote offhand and failed
. I tried to think of a good enough test case without digging through my archives. Production code uses 8 or 10 bits per component and 64-bit arithmetic (or two pairs of 32-bit values). Correct code would be
uint_fast16_t blend555(uint_fast16_t r5g5b5_0, uint_fast16_t r5g5b5_1, int_fast8_t phase)
{
const uint_fast32_t rgb0 = (r5g5b5_0 & 0x001F)
| ((r5g5b5_0 & 0x03E0) << 5)
| ((r5g5b5_0 & 0x7C00) << 10);
const uint_fast32_t rgb1 = (r5g5b5_1 & 0x001F)
| ((r5g5b5_1 & 0x03E0) << 5)
| ((r5g5b5_1 & 0x7C00) << 10);
if (phase < 0)
phase = 0;
if (phase > 32)
phase = 32;
const uint_fast32_t rgb = (32 - phase) * rgb0 + phase * rgb1 // Blending
+ UINT32_C(0x01004010); // Rounding
return ((rgb >> 5) & 0x001F)
| ((rgb >> 10) & 0x03E0)
| ((rgb >> 15) & 0x7C00);
}
as you surmised. (This one I actually verified, you see. It produces the same results as using floating-point arithmetic for red, green, and blue separately, then rounding the results in the normal fashion.)
The 33 is actually a funky detail: \$(2^n + 1)(2^n - 1) = 2^{2 n} - 1\$. It means that scaling some value from \$n\$ bits to \$2n\$ bits is done via multiplication by a factor between \$0\$ and \$2^n+1\$, inclusive. Here, we do a weighted sum instead of that kind of scaling, so the addition of "0.5" per component is proper "rounding".
How does this relate to FPU?
Not directly: it was a sidetrack about the fundamental difference in paradigm between Cortex-M and RISC-V architectures. It happened, because I mis-chose the link to Compiler Explorer, and then suggested examining pretty bad code in Compiler Explorer.
The intent behind was to show that although RISC-V is easy to generate code for, compiler optimizations are not as mature as for e.g. Cortex-M. You can see if you simply compare the generated code between different versions of the same compiler; I suggest using
-O2 -Wall -march=armv7e-m -mtune=cortex-m4 -mthumb for Cortex-M4. Between GCC 10.2.1 and 14.2.0, the same assembly is produced for Cortex-M4 using none-eabi, with
-O2 and
-Os having only trivial differences.
However, there
was a tie. Let me explain (and apologies for the long post):
I picked the blend function as an example because it is something that is often done with floating-point arithmetic instead. The corrected version produces exactly the same values as the standard RGB blending, including rounding. The underlying idea is SIMD: expanding the argument(s) to a single integer "vector", with room between components so that they will never overflow into each other. Indeed, only two multiplications are needed using this scheme, one of each coefficient ("phase" and "inverse phase"). In practice, most of the operation cost is in expanding the "vector" from the source values, and packing the result "vector" back to a returnable value.
The technique can be used with up to 10 bits per color component using 64-bit arithmetic, but at some point depending on the architecture, saving on the number of multiplications just is not worth the effort of unpacking and re-packing the "vector". It also does not work optimally for e.g. RGB565 format, where the components have different sizes. For example, with 8 bits per color component, you can save multiplication operations using the method here, but the packing and unpacking operation has a fixed "cost" (depending only on the number of components, not the size of each component). It may be less costly to do at least some of the multiplications separately, and avoid some/most of the packing and unpacking work.
This is analogous to FPU use in an IRQ with lazy stacking enabled. When lazy FPU stacking is enabled, the use of FPU in an IRQ will have a fixed cost compared to no FPU use at all. However, any additional FPU use will not incur an extra cost. So, one must consider the cost of implementing the calculation in fixed-point arithmetic or otherwise integer arithmetic, versus how much faster/easier to maintain the code would be if FPU were used – including all possible uses in the IRQ, not just the key part that triggered the consideration. The details of how the lazy stacking of FPU registers (or any other registers) does not matter at all, because we do not have a kernel-userspace type of interface at hand: it will just work.