Author Topic: Can I increment an integer faster if 'step' is a power of 2? (Read 4320 times)

741 · « **on:** July 12, 2021, 12:00:33 pm »

This kind of thing - of course the compiler is free to lose any 'help' I am trying to give it with these powers of two, but anyway...

#define INCREMENT 64
uint16_t u;

for(u = 0; u < LIMIT; u += INCREMENT)
{
//Do whatever
}

tggzzz · « **Reply #1 on:** July 12, 2021, 12:18:00 pm »

With questions like this, the answer will depend on the what's inside the loop, target processor, the compiler, the compiler version, the compilation flags - and what is/isn't in any cache.

The contents of the loop would have to be small for it to make a significant difference - if any. Don't microoptimise!

If you want to play around, insert your code into https://godbolt.org/

T3sl4co1l · « **Reply #2 on:** July 12, 2021, 12:27:30 pm »

I suppose an increment of 1 might be slightly faster on some platforms. Difference being there's usually an INC instruction, whereas for others it's LDI reg, INCREMENT / ADD accum, reg, or ADD imm, more or longer instructions in either case.

I would not worry about it, at all. There are far more important things to think about. Loop optimization is a late, late stage of development.

The loop check may also vary, for example an increasing loop uses CMP accum, LIMIT which may be longer than the TST accum (or none at all when the ADD/SUB instruction returns zero) for decreasing to zero. Mind that comparisons to zero are only equal when the initial value is a multiple of the increment.

Loops can be on pointers as well. Instead of

Code: [Select]

for (i = 0; i < size; i++) {
	q[i] = p[i];
}

you might have something like,

Code: [Select]

void* qq = q;
void* pp = p;
void* ppend = pp + size;
for (; pp < ppend; ) {
	*q++ = *p++;
}

In certain (simple) cases, the compiler may elucidate this for you -- check the output listings. This tends to be more compact, but may not necessarily be faster, for example on 8-bit platforms the 16-bit pointer compare requires at least two instructions (compare, compare with carry), usually four (load immediate for each byte of the comparison).

And in such cases, you could go even further and think about page-aligned accesses, for example copying a buffer aligned to 256-byte start and length, so that the low-byte compare can be trivial (0x00). The compiler won't know about this; this only applies once you're deep in the assembler, desperate for CPU cycles, pushing around object allocations for lucky breaks like this.

Tim

gbaddeley · « **Reply #3 on:** July 12, 2021, 12:34:16 pm »

Integer Increment is basically the same low level logic complexity as add, whether it’s a power of 2 or other integer. Some cpus have a shorter instruction size for increment than add (1), to make a small program size saving.

We’re you hoping to make the C for loop run with lower overhead?

741 · « **Reply #4 on:** July 12, 2021, 01:23:17 pm »

Quote from: gbaddeley on July 12, 2021, 12:34:16 pm

Integer Increment is basically the same low level logic complexity as add

Thanks, that is what I suspected. This is on a PIC, and it is ages since I used assembly. The loop speed matters, since I want low overhead to be sure that N fixed delays add up to what I intended (N * fixed delay). I have to take the ADC sample at the correct time relative to initiating the readings.

TM4 screen dumps show Microchip's RE46C200 smoke detector in a test mode. Yellow is integrator, blue is DAC.

Integrator slope is proportional to illumination. First the chip takes DARK readings, then LIT readings.

Once the fixed-time (here 200us) integration completes, the DAC ramps up at a fixed rate until is change in value matches the preceeding integration (in it's day job it is part on an on-chip ADC).

---> The time taken by the DAC slope varies with illumination.

Note 1: DAC limit is 5.3V.
Note 2: I can't easily use the datum of the small dip in IR LED voltage (start of LIT integration, IR LED is on) with the hardware I have in place. I'd like to use timings from initial start-test only.

coppice · « **Reply #5 on:** July 12, 2021, 01:35:54 pm »

Quote from: T3sl4co1l on July 12, 2021, 12:27:30 pm

I suppose an increment of 1 might be slightly faster on some platforms. Difference being there's usually an INC instruction, whereas for others it's LDI reg, INCREMENT / ADD accum, reg, or ADD imm, more or longer instructions in either case.

A number of machines allow fast incrementing by 1, 2, 4 and perhaps 8, as these are the increments needed for incrementing the addresses of the various sizes of variable supported by the machine.

T3sl4co1l · « **Reply #6 on:** July 12, 2021, 02:51:25 pm »

Quote from: coppice on July 12, 2021, 01:35:54 pm

Quote from: T3sl4co1l on July 12, 2021, 12:27:30 pm
I suppose an increment of 1 might be slightly faster on some platforms. Difference being there's usually an INC instruction, whereas for others it's LDI reg, INCREMENT / ADD accum, reg, or ADD imm, more or longer instructions in either case.
A number of machines allow fast incrementing by 1, 2, 4 and perhaps 8, as these are the increments needed for incrementing the addresses of the various sizes of variable supported by the machine.

Ah, neat. I haven't seen any like that (that I know of). Though I do know of an analogous feature. On 80386+, the extended address calculation allows addition, immediate offset, and multiplication by several values (depending on whether you use the multiplicand in the addition), all done in separate hardware at nearly no expense. This is one case where the loop,

Code: [Select]

char *p;
for (i = 0; i < MAX; i++) {
	p[i * STEP]++;
}

is one of the better ways to do it (assuming it's followed literally, and not optimized to another form like as mentioned above). Mind, this multiplication is exactly implicit in non-void pointers: "*(p+1)" aka "p[1]" is the next element in the array, of whatever size p is typed as; it's not the raw address! I use char above, to force a size of 1 so the array access is obviously spaced out. I could've written it for any sized object, but it may not be as apparent unless one is aware of this. Usual caveats, mind alignment and padding, depending on platform, if you need to save memory or optimize for accesses like these.

Tim

vad · « **Reply #7 on:** July 12, 2021, 03:44:09 pm »

PIC assembly language has INC and INC2 instructions, but they only increment register value by 1 or 2 correspondingly. Compilers will handle adding of any other constant value to a register using instruction ADDLW or similar.

Check your exact PIC model’s programming manual to see the number of cycles it takes ADDLW instruction to execute. If it is 1 cycle (the most likely answer), then you cannot go any faster than that. INC and INC2 would not be faster.

If the number of cycles is greater than 1, you are still out of luck, unless you are incrementing by 1 or 2…

SiliconWizard · « **Reply #8 on:** July 12, 2021, 05:22:22 pm »

Quote from: tggzzz on July 12, 2021, 12:18:00 pm

With questions like this, the answer will depend on the what's inside the loop, target processor, the compiler, the compiler version, the compilation flags - and what is/isn't in any cache.

Of course. One obvious point here - the OP's question sort of assumes that regular integer addition could be more expensive than some other instruction on a given target. This is very uncommon. I don't think I have ever seen such a target myself, or it's so old that I don't remember.

At least... for addition of 'native' width. I haven't seen the OP mentioning anything about the target, so... we can just guess. All they said was "PIC" unless I missed something else, and there are 8-bit to 32-bit MCUs in the PIC line. So...

Obviously, in the OP's example, if you want to increment 16-bit integers on an 8-bit target (with 8-bit addition only), you can optimize the incrementation if the increment is all zeroes in the lower half, because it will make no difference to the lower half. So you can just add the upper half with the increment. That would work for any increment that is all zeroes in its lower half, not just powers of two. Now that's good if the increment is known in advance of course; if it has to be tested at each iteration, the the cost of the test (and branching) will very likely counterbalance any benefit.

This is just a general approach here. I'm not considering possible specific instructions that could exist on a given target. Some mentioned address incrementing, if I got it right, and again that it be more efficient would all depend on the exact architecture. If you're incrementing a 16-bit integer on at least a 16-bit processor, then there is really nothing to gain whatsoever (except maybe on a very crippled 16-bit CPU that would use more cycles to issue a 16-bit addition than other operations, which would be pretty rare.) But for an 8-bit target (except possibly one that would still have 16-bit addition using register pairs, yes that has existed), the only generic optimization I can think of would be what I said above: the lower half being ZERO.

Now if we take that question outside of the software world and consider pure digital implementation, then certainly, incrementing by a power of 2 will be much less expensive in terms of resources than incrementing by a number that has more than 1 bit set.

T3sl4co1l · « **Reply #9 on:** July 12, 2021, 05:31:56 pm »

Quote from: SiliconWizard on July 12, 2021, 05:22:22 pm

Now if we take that question outside of the software world and consider pure digital implementation, then certainly, incrementing by a power of 2 will be much less expensive in terms of resources than incrementing by a number that has more than 1 bit set.

Heh, would an FPGA even make that optimization?

I guess so? Not sure which way exactly it'd be done. For example, using adder builtins, but right/left aligning them, and strapping the extra (unused or default) bits to zero.

Or if there's no adder fabric (I forget if anyone bothers to implement that actually? multipliers definitely) then it doesn't matter, it's just a handful of gates, hopefully utilizing their carry-ins.

Tim

SiliconWizard · « **Reply #10 on:** July 12, 2021, 05:45:25 pm »

Quote from: T3sl4co1l on July 12, 2021, 05:31:56 pm

Quote from: SiliconWizard on July 12, 2021, 05:22:22 pm
Now if we take that question outside of the software world and consider pure digital implementation, then certainly, incrementing by a power of 2 will be much less expensive in terms of resources than incrementing by a number that has more than 1 bit set.

Heh, would an FPGA even make that optimization?

I guess so? Not sure which way exactly it'd be done. For example, using adder builtins, but right/left aligning them, and strapping the extra (unused or default) bits to zero.

Or if there's no adder fabric (I forget if anyone bothers to implement that actually? multipliers definitely) then it doesn't matter, it's just a handful of gates, hopefully utilizing their carry-ins.

Tim

On FPGAs specifically (which I was not addressing in particular), I guess it will all depend on the underlying architecture. But as with my optimization tip above, the minimum optimization it will issue is that it will NOT require adders for all the least significant bits that are zero. That still gives more opportunities for optimization. Now the end result will of course all depend on the width of the number to increment, the exact implementation of built-in adders, and so on. And the increment itself, of course. The more LS bits cleared (of course if the increment is constant), and the less expensive it is, in general.

TheCalligrapher · « **Reply #11 on:** July 12, 2021, 09:03:59 pm »

Quote from: 741 on July 12, 2021, 12:00:33 pm

This kind of thing - of course the compiler is free to lose any 'help' I am trying to give it with these powers of two, but anyway...

#define INCREMENT 64

If an optimization exists on your platform, it is definitely an "easy and obvious" optimization, which will be applied by the compiler itself. Nothing needs to be done on your side, unless you know for sure that your compiler is deficient in this regard.

rstofer · « **Reply #12 on:** July 12, 2021, 09:57:44 pm »

Quote from: SiliconWizard on July 12, 2021, 05:22:22 pm

Of course. One obvious point here - the OP's question sort of assumes that regular integer addition could be more expensive than some other instruction on a given target. This is very uncommon. I don't think I have ever seen such a target myself, or it's so old that I don't remember.

Some designs (like ARM) have an 'add immediate' instruction where the increment is encoded with the instruction for small values on increment (0..4095). Otherwise, the operand is required to be in a register and this may require an indexed fetch from a pool area of constants. The value 'may' be in a register but I don't think you can assume it is unless you play with assembly coding.

https://developer.arm.com/documentation/dui0802/a/A64-General-Instructions/ADD--immediate-

'Premature optimization is the root of all evil!' - Sir Tony Hoare

https://ubiquity.acm.org/article.cfm?id=1513451

SiliconWizard · « **Reply #13 on:** July 12, 2021, 10:23:56 pm »

As to optimizing increments for larger-than-native integers, as I mentioned, a decent compiler will also do it on its own. Again as long as the increment is known statically (thus is a constant).

As an example with GCC and RV32:

Code: [Select]

#include <stdint.h>

uint64_t Inc1(uint64_t n)
{
	return n + 1;
}

uint64_t Inc2(uint64_t n)
{
	return n + 0x100000000;    // 2^32
}

yields: (parameter in {a1, a0}; return value in {a1, a0})

Code: [Select]

Inc1:
	mv	a5,a0
	addi	a0,a0,1
	sltu	a5,a0,a5
	add	a1,a5,a1
	ret
	
Inc2:
	addi	a1,a1,1
	ret

TheCalligrapher · « **Reply #14 on:** July 12, 2021, 10:25:49 pm »

Quote from: SiliconWizard on July 12, 2021, 05:22:22 pm

Of course. One obvious point here - the OP's question sort of assumes that regular integer addition could be more expensive than some other instruction on a given target. This is very uncommon. I don't think I have ever seen such a target myself, or it's so old that I don't remember.

There is not such thing as "regular old addition". Or, more precisely, there are many kinds of "regular old addition" even on fairly archaic platforms. The obvious example is various flavors of "load address" instructions (`lea` in x86 parlance, for example), which are widely used for efficient constant arithmetic, then "immediate constant operand" additions, then "constant in a register" additions and, finally, "constant in memory" additions.

And even on a platform that has no fancy alternatives to a plain-vanilla `add` instruction, it might still end up being as efficient as it gets if the delta is pre-loaded in a register before the cycle (assuming a register can be reserved for the entire duration of the cycle).

SiliconWizard · « **Reply #15 on:** July 12, 2021, 10:49:08 pm »

Quote from: rstofer on July 12, 2021, 09:57:44 pm

Quote from: SiliconWizard on July 12, 2021, 05:22:22 pm

Of course. One obvious point here - the OP's question sort of assumes that regular integer addition could be more expensive than some other instruction on a given target. This is very uncommon. I don't think I have ever seen such a target myself, or it's so old that I don't remember.

Some designs (like ARM) have an 'add immediate' instruction where the increment is encoded with the instruction for small values on increment (0..4095). Otherwise, the operand is required to be in a register and this may require an indexed fetch from a pool area of constants. The value 'may' be in a register but I don't think you can assume it is unless you play with assembly coding

Many ISAs have "add immediate" instructions. Obviously if the increment can't fit an immediate, you need to load it first in a register. Depending on surrounding code, this register will hold this constant every time the increment is done (in a loop), or it may have to reload it every time (could be also done 'indirectly' by register save and load if you are calling functions in between, or all registers are used up, or interrupts kick in.) Note that in cases it has to load a register every time with the constant, that usually means a significant number of instructions have been executed in between, and then one instruction more or less won't make a fricking difference. That's not even micro-optimizing, at this level. It's nano-optimizing?

Just a thought.

Of course for "add immediate" instructions to be used, the increment must be known at compile-time.

mikerj · « **Reply #16 on:** July 13, 2021, 10:04:04 pm »

Quote from: 741 on July 12, 2021, 01:23:17 pm

Quote from: gbaddeley on July 12, 2021, 12:34:16 pm
Integer Increment is basically the same low level logic complexity as add

Thanks, that is what I suspected. This is on a PIC, and it is ages since I used assembly. The loop speed matters, since I want low overhead to be sure that N fixed delays add up to what I intended (N * fixed delay). I have to take the ADC sample at the correct time relative to initiating the readings.

On the lower end PICs at least, decrementing the loop counter can shave a few cycles because the decfsz instruction is pretty much all you need.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Can I increment an integer faster if 'step' is a power of 2? (Read 4320 times)

741

Can I increment an integer faster if 'step' is a power of 2?

tggzzz

Re: Can I increment an integer faster if 'step' is a power of 2?

T3sl4co1l

Re: Can I increment an integer faster if 'step' is a power of 2?

gbaddeley

Re: Can I increment an integer faster if 'step' is a power of 2?

741

Re: Can I increment an integer faster if 'step' is a power of 2?

coppice

Re: Can I increment an integer faster if 'step' is a power of 2?

T3sl4co1l

Re: Can I increment an integer faster if 'step' is a power of 2?

vad

Re: Can I increment an integer faster if 'step' is a power of 2?

SiliconWizard

Re: Can I increment an integer faster if 'step' is a power of 2?

T3sl4co1l

Re: Can I increment an integer faster if 'step' is a power of 2?

SiliconWizard

Re: Can I increment an integer faster if 'step' is a power of 2?

TheCalligrapher

Re: Can I increment an integer faster if 'step' is a power of 2?

rstofer

Re: Can I increment an integer faster if 'step' is a power of 2?

SiliconWizard

Re: Can I increment an integer faster if 'step' is a power of 2?

TheCalligrapher

Re: Can I increment an integer faster if 'step' is a power of 2?

SiliconWizard

Re: Can I increment an integer faster if 'step' is a power of 2?

mikerj

Re: Can I increment an integer faster if 'step' is a power of 2?

Share me