In general, examining the operation to be done, and dividing it in different ways can yield much better solutions than just mapping machine instructions from one architecture to another.
If you do not consider the capabilities of the target instruction set architecture, and only look for
equivalent instructions or equivalent instruction sequences, you won't find the most efficient ways to implement the underlying sequence of operations.
For ARM Cortex-M4 and -M7, you want
ARMv7-M Architecture Reference Manual. (Both M4 and M7 have the DSP extension mentioned built-in, i.e. SMLAD and such.)
Even such a simple operation as blending two 15-bit RGB pixel values together, i.e.
p = 0 (0%) to 33 (100%) r' = (r * p + R * (33 - p)) >> 5 = (R*33 + p*(r - R)) >> 5 g' = (g * p + G * (33 - p)) >> 5 = (G*33 + p*(g - G)) >> 5 b' = (b * p + B * (33 - p)) >> 5 = (B*33 + p*(b - B)) >> 5can be implemented in many different ways on 32-bit architectures, depending on exactly what kind of machine instructions are available and efficient.
You definitely do not need to do six multiplications per pixel as the above might suggest. And yes, 100% blend of rgb, 0% of RGB, is actually
p = 2n+1 for
n-bit color components, like
n=5 here,
not 2n. Many disagree, but they're wrong. It is trivial to verify. (It is because
(2n-1)*(2n+1) = 22*n-1, and the right shift is a truncation/rounding towards zero.)