Author Topic: How are micros programmed in real world situations? (Read 31867 times)

Howardlong · « **Reply #50 on:** April 17, 2015, 10:14:03 am »

The difficulty comes a tiny bit further along, when you want to start using a peripheral and need find out what's been initialised and what hasn't, and what clocks have been used for what, for example you don't want to accidentally start messing with any clocking on flashless NXP devices until you know where the SPIFI clock is coming from. The only way to do that is to look at, and understand, the startup code. On some ARM devices there's a lot of it, including boot ROM, CRT startup and board-specific initialisation. Some devices, like the PIC, have a very very limited amount of startup, just the generic CRT startup in fact that in most cases you don't need to know anything about at all. It does mean though that one of the first tasks many of us need to do is to fanny about getting the system clock running at a reasonable speed, but at least it's transparent.

But in general, yes, you should be able to get at least a ready-to-go Blinky project going reasonably quickly on any platform, assuming the toolchains, including debugging tools, are documented in an accurate and reasonably succinct way, preferably in a single, concise "getting started" document. What happens after that can be rather more time consuming, I doubt there are any platforms without nonsense that you have to overcome through failure first.

atferrari · « **Reply #51 on:** April 17, 2015, 11:40:33 am »

Quote from: Howardlong on April 12, 2015, 09:00:41 pm

PICs have their own frustrations, even at the bottom end 8 bit devices. Some design decisions such as defaulting to analogue pins on GPIOs I still don't understand, but it becomes a natural reaction to switch them to GPIO as one of the first thing you do. For a beginner this kind of thing is just another barrier. Spending an hour or two just getting the core clock going at the right frequency with a mixture of fuse and functional programming is not unheard of.

In line with that, I always think that, had I to teach how to use those micros (16F and 18F families) after a cursory reading of the datasheet, I would explain how to initialize the micro (to blink a LED, so to speak) my major stumbling block. Immediately after I would go with interruptions (I was really scared of them). The way I see this, then it is all peripherals and imagination.

But then, that is me.

VEGETA · « **Reply #52 on:** April 17, 2015, 02:06:54 pm »

what I understood from the last discussion is that working with ARM is not very different or more difficult than PIC or so. However, the setup is more difficult and there are stuff you need to do or set up in code before getting to do your code.

Now, is there any good resource to learn working with ARM MCUs? a book for example?

I am interested in Renesas RX62N board (the 99$ one i guess), which has a good course here (https://www.youtube.com/playlist?list=PLPIqCiMhcdO7TJiOvupVWuVCEsSWMKuWJ) based on it. And it has a whole book on embedded systems too (download: http://webpages.uncc.edu/~jmconrad/ECGR4101-2012-01/notes/All_ES_Conrad_Final_Soft_Proof_Blk.pdf ).

I have searched amazon for embedded books but got confused, many books are pricey and i don't know if it is good or not. maybe getting that dev board and start reading the book of it is the best?

Howardlong · « **Reply #53 on:** April 17, 2015, 03:19:14 pm »

Quote from: VEGETA on April 17, 2015, 02:06:54 pm

Now, is there any good resource to learn working with ARM MCUs? a book for example?

In my opinion, the best way is by doing. As I mentioned earlier, the stumbling block with ARM is finding the starting point because there are so many options often obfuscated by both ARM's and the vendors' marketing. I would say that if you already have a relationship with a particular vendor's devices who also have ARM, then go with them, as you will at least know your way around theor website and documentation standards.

I do have the Joseph Yiu books on The various Cortex cores but I very, very rarely refer to them.

I would pick a vendor's entry level ARM dev board and debugger as required, based on, say M0, that has a reasonable looking toolchain, with a getting started document, and go with that. Avoid cheapskating on third party debuggers unless you like a challenge, that might well end in tears, particularly if it's FTDI based.

AndyC_772 · « **Reply #54 on:** April 17, 2015, 03:34:25 pm »

I agree that a good tool chain is probably the best place to start.

I wasted a lot of time trying to set up Eclipse, and then even longer trying to work around bugs in CoIDE.

With hindsight, I should have simply bougt CrossWorks on day one. For a non-commercial licence it's ridiculously cheap, and even the 'full' version is much more cost-effective than most competing products.

westfw · « **Reply #55 on:** April 17, 2015, 07:19:03 pm »

Quote

clock distribution?

Most 8bit microcontrollers default to providing clock to most of the peripherals. To start using a peripheral, you start by writing values to the peripheral's control registers. "low power" microcontrollers may have clock control circuitry that disables peripherals to save power.
Most ARM chips default to NOT providing clock to most of the peripherals; if you write values to the peripheral control register, you get a fault condition (that you probably haven't set up to understand. So the chip mysteriously halts.)
It's not REALLY that much more complicated than "you have to disable the a2d and analog comparator before you can use those pins as digital IO" (PIC8), but ... it IS different. Most people aren't even particularly aware that GPIO ports NEED a clock.
And then the problem is enhanced by poor documentation. The STM32f103 has gotten a fair amount of discussion here - if you read their reference manual section on GPIO, the need to enable clock is not mentioned. If you SEARCH the entire 1100 page manual for "GPIO", you won't find anything about the clock (because for some reason the clock control bits get named xxx_IOPxxx. Perhaps that this is supposed to make it more obvious that you need to provide that clock if you use the non-GPIO functions that might be available on those port pins as well? Right.) (I know. Just use the GPIO_init() function provided by the vendor! What? It doesn't turn on the clock either?! Grr.)

I suppose it could all boil down to the fact that many 8-bit microcontrollers have now had several decades of documentation refinement and 3rd-party contributions to "overall community understanding." Including documents and projects aimed at clueless hobbyists. (I mean, how else can you explain that the PIC16F84, with all its architectural ugliness and low functionality, was THE hobbyist microcontroller of choice for ... quite a long time.)

ARM and other 32bit chips might eventually reach that "maturity." If the "churn rate" slows down, and it doesn't all get buried beneath an oversimplified and/or over-complicated abstraction layer (like Arduino or Linux.)

VEGETA · « **Reply #56 on:** April 17, 2015, 09:04:14 pm »

Quote from: AndyC_772 on April 17, 2015, 03:34:25 pm

I agree that a good tool chain is probably the best place to start.

I wasted a lot of time trying to set up Eclipse, and then even longer trying to work around bugs in CoIDE.

With hindsight, I should have simply bougt CrossWorks on day one. For a non-commercial licence it's ridiculously cheap, and even the 'full' version is much more cost-effective than most competing products.

This is another hardship. I like Renesas MCUs and I really want to learn how to use them. However, look at their programmers for example... there is about 3 of them, you don't know which one to buy for your needs as they support some MCUs and don't support other.... And, their IDEs for example, one based on eclipse and another one called HEW.... It is a total mess!

why wouldn't they have one programmer (other higher versions for production) and one IDE? And there is this JTAG debugger from segger and those other stuff like it!

^
That drove me to decide to buy the renesas dev board for rx62n (99$ one) when i have money and time, as i would learn to deal with the mcu without the hassle of the tools and stuff.

Now, when it comes to the point one should use RTOS, is it hard (assuming that I know to program the MCU normally)? I see that these RTOS have some .h and .c files that gets compiled along with the main.c that the user writes (sometimes more than just main.c as i see)... is dealing with them have different procedure or it is just coding?

thanks, and sorry for too much questions xD

Mechanical Menace · « **Reply #57 on:** April 20, 2015, 01:13:14 pm »

Quote from: VEGETA on April 17, 2015, 09:04:14 pm

Now, when it comes to the point one should use RTOS, is it hard (assuming that I know to program the MCU normally)?

If you actually need an RTOS they can make things easier, they (should) do all the hard work of making sure your time critical processes get cycles and access to peripherals exactly when needed while still doing a decent job at scheduling none time sensitive processes lol.

Quote

I see that these RTOS have some .h and .c files that gets compiled along with the main.c that the user writes (sometimes more than just main.c as i see)... is dealing with them have different procedure or it is just coding?

RTOS's can be a little more demanding on the userspace programmer than normal OS's, but at the end of the day they're still presented to you as Yet Another Library.

Howardlong · « **Reply #58 on:** April 20, 2015, 04:07:42 pm »

Quote from: Mechanical Menace on April 20, 2015, 01:13:14 pm

Quote from: VEGETA on April 17, 2015, 09:04:14 pm
Now, when it comes to the point one should use RTOS, is it hard (assuming that I know to program the MCU normally)?

If you actually need an RTOS they can make things easier, they (should) do all the hard work of making sure your time critical processes get cycles and access to peripherals exactly when needed while still doing a decent job at scheduling none time sensitive processes lol.

Quote
I see that these RTOS have some .h and .c files that gets compiled along with the main.c that the user writes (sometimes more than just main.c as i see)... is dealing with them have different procedure or it is just coding?

RTOS's can be a little more demanding on the userspace programmer than normal OS's, but at the end of the day they're still presented to you as Yet Another Library.

In general I agree, although you generally use a different mindset at the outset when going for an RTOS rather than a superloop style.

Superloops are much more deterministic, although if you can make your RTOS design to be non-preemptive then the non-deterministic nature can be controlled more than a completely pre-emptive solution. Non-preemptive solutions also remove context switching overhead.

Generally if at all possible, I avoid generic heap management in any resource-limited embedded applications, as it's another thing to control. You can dynamically allocate, but I generally if I do so I just use statically allocated pools of fixed-sized blobs for each requirement: using generic malloc type allocation leads to fragmentation, and unexpected consequences. This applies whether you use and RTOS or not though.

You can do an awful lot without an RTOS, but sometimes the complexity of a requirement makes an RTOS a more reasonable choice overall.

Mechanical Menace · « **Reply #59 on:** April 20, 2015, 04:53:57 pm »

Quote from: Howardlong on April 20, 2015, 04:07:42 pm

Non-preemptive solutions also remove context switching overhead.

Are you sure? Even with cooperative multitasking when a process hands off back to the OS there's still a context switch to kernel mode and back to user mode, the stack (or it's location depending on archtecture), registers and the instruction pointer still have to be read, stored, read from ram and restored...

zirlou21 · « **Reply #60 on:** April 20, 2015, 05:10:28 pm »

...good day to all.

...getting back with programming aspect; c or assembly.

...assembly is good when c cannot provide adequate control over hardware, eg. bit manipulating.
there are times when you need to shift or rotate a byte or word of data but your mcu-compatible c program lacks the command for it.Some mcu needs a outside library to support the need for it.,
so inline assembly is the solution for this.

...example implimenting a WDT program needs inline assembly coding...just what the zilog encore do more about.

...what do you think guy's.

Howardlong · « **Reply #61 on:** April 20, 2015, 06:27:59 pm »

Quote from: Mechanical Menace on April 20, 2015, 04:53:57 pm

Quote from: Howardlong on April 20, 2015, 04:07:42 pm
Non-preemptive solutions also remove context switching overhead.

Are you sure? Even with cooperative multitasking when a process hands off back to the OS there's still a context switch to kernel mode and back to user mode, the stack (or it's location depending on archtecture), registers and the instruction pointer still have to be read, stored, read from ram and restored...

I thought a bit about that a little bit before writing it, but not hard enough, you are right. There is indeed a context switch, but only when you allow it, so you don't have to maintain things like separate stacks. Apologies for my brain fart!

Even in non-preemptive systems, there's almost certainly some pre-emption from ISRs, but the trick is to minimise the resources used in those ISRs to things like a simple semaphore, mutex or queue operation.

Howardlong · « **Reply #62 on:** April 20, 2015, 06:58:48 pm »

Quote from: zirlou21 on April 20, 2015, 05:10:28 pm

...good day to all.

...getting back with programming aspect; c or assembly.

...assembly is good when c cannot provide adequate control over hardware, eg. bit manipulating.
there are times when you need to shift or rotate a byte or word of data but your mcu-compatible c program lacks the command for it.Some mcu needs a outside library to support the need for it.,
so inline assembly is the solution for this.

...example implimenting a WDT program needs inline assembly coding...just what the zilog encore do more about.

...what do you think guy's.

I think it depends on the environment, but these days it's extremely rare for me to write any assembler.

The most common for me is a Nop() for some ARM implementations which lack such a thing, allowing me to place breakpoints in specific places in code, particularly when the optimiser's on. It can be as simple as this:

#define NOP() asm volatile("mov r0, r0")

The last time I did any serious assembler was about four years ago when trying to squeeze the last drop out of a PIC24, performing some very tight DSP operations within a 1ms USB frame time. But there was a time when assembler, or more correctly machine code, was all I wrote.

For readability I'd much rather do it in C, doing assembler requires a lot of comments, often at least one every line or two, as much for my sanity as anyone else's when I come back to it a year later.

You can often use __builtin intrinsic functions to improve speed. The plus side of this is that they are more readable than assembler and they do what you tell them to, such as multiply two 16 bit variables and give a 32 bit answer. The bad news is that if, say, an addressing mode isn't well supported under the hood, you can end up generating more instructions than you thought, and just like assembler it's hardly portable. But then we are in the embedded world, if you get this far and decide you've chosen the wrong MCU, there's a lot more to be concerned about than a few __builtins and a couple of dozen lines of assembler.

For many implementations these days, you can do things like WDT resets with other intrinsic functions. The same applies to doing shifts and rotates where often there's a __builtin intrinsic function. I'd far rather use that than resort to assembler, assembler is very hard to maintain.

But I guess there are still some implementations that lack a lot of these intrinsics, in which case you have little choice.

andersm · « **Reply #63 on:** April 20, 2015, 07:38:51 pm »

Quote from: zirlou21 on April 20, 2015, 05:10:28 pm

...assembly is good when c cannot provide adequate control over hardware, eg. bit manipulating.
there are times when you need to shift or rotate a byte or word of data but your mcu-compatible c program lacks the command for it.Some mcu needs a outside library to support the need for it.,
so inline assembly is the solution for this.

A decent compiler will synthesize most operations for you, as long as you express clearly what you want. If you use a less-known compiler or less-known architecture you may have to resort to intrinsics or inline assembly.

Code: [Select]

unsigned int rol(unsigned int a)
{
  return (a << 1)|(a >> (sizeof(a)*8-1));
}

unsigned int ror(unsigned int a)
{
  return (a >> 1)|(a << (sizeof(a)*8-1));
}


00000000 <rol>:
   0:	ea4f 70f0 	mov.w	r0, r0, ror #31
   4:	4770      	bx	lr
   6:	bf00      	nop

00000008 <ror>:
   8:	ea4f 0070 	mov.w	r0, r0, ror #1
   c:	4770      	bx	lr
   e:	bf00      	nop

nctnico · « **Reply #64 on:** April 20, 2015, 11:43:39 pm »

I agree. Thinking you can code better assembly than a C compiler is so 1990.

Howardlong · « **Reply #65 on:** April 27, 2015, 01:48:11 am »

Hmm, I might change my mind - slightly - on this. I've just spent the last two days optimising some DSP code on an LPC4370 (ARM M4F), and there are rare occasions when it can make sense.

I have two pieces of code I'm trying to squeeze the last cent out of, one is a polyphase decimator (the ARM CMSIS-DSP decimator is _not_ polyphase but the interpolator is, go figure) and the other is a quadrature NCO oscillator and mixer.

I first spent some time trying to make the C code work reasonably fast. Keep in mind that my CPU cycle budget was already over by over 200% when I started looking at this on Friday, short of overclocking the 200MHz part to 600MHz, something had to give. With carefully crafted idioms and some data restructuring, I got that down to about 50% over budget, an improvement, but not enough. The environment is LPCXpresso 7.7.2, so gcc is the compiler. I have now recoded both bottlenecks in inline assembler, and after several hours have managed to improve optimised C code for the polyphase decimator from 3.3Msps (input samples) to 6.25Msps. The quadrature oscillator and mixer is improved from 8.5Msps to 12.5Msps. For both, I am pretty sure I can squeeze out another 5-10%, there are some stalls that as yet I'm unable to explain away.

The approach on the C side was to move from flash to RAM, then carefully code up compiler idioms, and examine what's generated. There was at this point a fair bit of loop unrolling, and some restructuring of data to allow the use of LDM/STM/VLDM/VSTM multiple register loads and stores.

Then I rolled up my sleeves and embarked on the most assembler I've written for probably a couple of decades.

On the assembler side, the human has the benefit that they know the nature of the parameters and how functions are called. The compiler just doesn't know that. With this knowledge, the code can be carefully crafted to avoid pipeline stalling by interleaving operations by making adjacent instructions have non-dependent registers, combining previously separate processes into one, avoiding excessive load and store operations, and keeping as much data as possible in registers (for processors like ARM especially). Again, being able to adjust your data structures and unrolling loops to make use of the LDM/STM/VLDM/VSTM instructions and minimise loop overhead helped.

In addition, for the NCO, there was some register moving going on for a delay line (it's a pair of IIR filters), and by unrolling the loop and recoding it, the register moving could be avoided. Combining the quadrature mixer and the quadrature NCO into a single process also made some savings.

So in short, you can sometimes improve things with assembler fairly significantly, but it's a bit of a last resort as maintenance is almost always going to be a head scratcher. Irrespective, to make any headway, even if you don't ever write any assembler, eventually you may well find that you need to understand the way the CPU works at least to the extent of being able to roughly follow the disassembled version in the debugger to be able to resolve a performance related problem.

Don't try this at home kids...

Code: [Select]

		__asm__ __volatile__
		(
			"\n\t"

			"vldm %[NCO],{s0-s7}	\n\t" // Initialise NCOs into registers

			"mov r6,#3				\n\t" // Divide by 3
			"udiv r5,%[NUMSAMPLES],r6	\n\t" // Result in R5 // ***Check this udiv, takes 160ns, needs work
			"mls r6,r6,r5,%[NUMSAMPLES]	\n\t" // Remainder in R6

			"cbz r5,loopexit1	\n\t"

			// Calculate NCOs, three at a time to avoid shuffling registers about
			"\nloop1:			\n\t"

			"vldm %[IN]!,{s10-s12}	\n\t" // Get next three samples

			"vmul.f32 s4,s6,s0	\n\t" // Iy0=A1*Iy1 // s2=Iy0, s0=Iy1, s1=Iy2 // NCO
			"vmul.f32 s5,s6,s1	\n\t" // Qy0=A1*Qy1
			"vmul.f32 s8,s7,s2	\n\t" // s8=A2*Iy2
			"vmul.f32 s9,s7,s3	\n\t" // s9=A2*Qy2
			"vadd.f32 s4,s4,s8	\n\t" // Iy0=A1*Iy1 + A2*Iy2
			"vadd.f32 s5,s5,s9	\n\t" // Qy0=A1*Qy1 + A2*Qy2

			"vmul.f32 s13,s4,s10	\n\t" // s13=Iy0*In[0] // Mixer
			"vmul.f32 s16,s5,s10	\n\t" // s16=Qy0*In[0]

			"vmul.f32 s2,s6,s4	\n\t" // Iy0=A1*Iy1 // s1=Iy0, s2=Iy1, s0=Iy2 // NCO
			"vmul.f32 s3,s6,s5	\n\t" // Qy0=A1*Qy1
			"vmul.f32 s8,s7,s0	\n\t" // s8=A2*Iy2
			"vmul.f32 s9,s7,s1	\n\t" // s9=A2*Qy2
			"vadd.f32 s2,s2,s8	\n\t" // Iy0=A1*Iy1 + A2*Iy2
			"vadd.f32 s3,s3,s9	\n\t" // Qy0=A1*Qy1 + A2*Qy2

			"vmul.f32 s14,s2,s11	\n\t" // s14=Iy0*In[1] // Mixer
			"vmul.f32 s17,s3,s11	\n\t" // s17=Qy0*In[1]

			"vmul.f32 s0,s6,s2	\n\t" // Iy0=A1*Iy1 // s0=Iy0, s1=Iy1, s2=Iy2 // NCO
			"vmul.f32 s1,s6,s3	\n\t" // Qy0=A1*Qy1
			"vmul.f32 s8,s7,s4	\n\t" // s8=A2*Iy2
			"vmul.f32 s9,s7,s5	\n\t" // s9=A2*Qy2
			"vadd.f32 s0,s0,s8	\n\t" // Iy0=A1*Iy1 + A2*Iy2
			"vadd.f32 s1,s1,s9	\n\t" // Qy0=A1*Qy1 + A2*Qy2

			"vmul.f32 s15,s0,s12	\n\t" // s15=Iy0*In[2] // Mixer
			"vmul.f32 s18,s1,s12	\n\t" // s18=Qy0*In[2]

			"vstm %[OUTI]!,{s13-s15}	\n\t" // Write the complex downconverted sample out
			"vstm %[OUTQ]!,{s16-s18}	\n\t"

			"subs r5,#1		\n\t" // Loop counter
			"bne loop1			\n\t"

			"\nloopexit1:		\n\t"
			"cbz r6,loopexit2	\n\t"

			"\nloop2:			\n\t" // This is the non-unrolled version for stragglers when num samples isn't divisible by 3

			"vldm %[IN]!,{s10}	\n\t" // Get next sample

			"vmov.f32 s4,s2		\n\t" // Iy2=Iy1 // Interleave I and Q instructions to prevent stalling // NCO
			"vmov.f32 s5,s3		\n\t" // Qy2=Qy1
			"vmov.f32 s2,s0		\n\t" // Iy1=Iy0
			"vmov.f32 s3,s1		\n\t" // Qy1=Qy0
			"vmul.f32 s0,s6,s2	\n\t" // Iy0=A1*Iy1
			"vmul.f32 s1,s6,s3	\n\t" // Qy0=A1*Qy1
			"vmul.f32 s8,s7,s4	\n\t" // s8=A2*Iy2
			"vmul.f32 s9,s7,s5	\n\t" // s9=A2*Qy2
			"vadd.f32 s0,s0,s8	\n\t" // Iy0=A1*Iy1 + A2*Iy2
			"vadd.f32 s1,s1,s9	\n\t" // Qy0=A1*Qy1 + A2*Qy2

			"vmul.f32 s8,s0,s10	\n\t" // s8=Iy0*In[0] // Mixer
			"vmul.f32 s9,s1,s10	\n\t" // s8=Qy0*In[0]

			"vstm %[OUTI]!,{s8}		\n\t" // Write the complex downconverted sample out
			"vstm %[OUTQ]!,{s9}	\n\t"

			"subs r6,#1			\n\t" // Loop counter
			"bne loop2				\n\t"

			"\nloopexit2:			\n\t"

			"vstm %[NCO],{s0-s5]	\n\t" // Store the NCO variables back

				: [OUTI]"+r" (pstOutI), [OUTQ]"+r" (pstOutQ), [IN]"+r" (pstIn)
				: [NCO]"r" (pncos), [NUMSAMPLES]"r" (nNumSamples)
		 : "r5","r6","s0", "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9","s10","s11","s12","s13","s14", "s15", "s16", "s17", "s18"

		);

nctnico · « **Reply #66 on:** April 27, 2015, 02:00:01 am »

Did you look into the SIMD instructions? Those should reduce the instruction fetch/decode/execution overhead.

Howardlong · « **Reply #67 on:** April 27, 2015, 02:09:00 am »

Quote from: nctnico on April 27, 2015, 02:00:01 am

Did you look into the SIMD instructions? Those should reduce the instruction fetch/decode/execution overhead.

Yes, they were of no use cycle-wise in this case, at the end of the day you're limited to 32 bits of data per cycle on the M4, and my samples are 32 bit.

I even coded it up in fixed point, but due to the lack of registers compared to the fpu, there was almost no cycle benefit going that way.

I did learn one thing, the floating point MAC instruction takes longer than carefully crafted separate multiply and add because the MAC stalls itself.

westfw · « **Reply #68 on:** April 27, 2015, 07:32:01 am »

Where are you getting the detailed info about register stalls and such? I would have thought that a MAC instruction would be carefully designed NOT to stall itself...

andersm · « **Reply #69 on:** April 27, 2015, 08:00:07 am »

The Cortex-M4 Technical Reference Manual gives one cycle each for VADD.F32 and VMUL.F32, and three cycles for VMLA.F32 and VFMA.F32.

Howardlong · « **Reply #70 on:** April 27, 2015, 09:49:56 am »

Quote from: westfw on April 27, 2015, 07:32:01 am

Where are you getting the detailed info about register stalls and such? I would have thought that a MAC instruction would be carefully designed NOT to stall itself...

You'd have thought wouldn't you? It is single cycle for fixed point. Not sure what the point of having an FP MAC instruction is TBH if it's not faster than separate instructions.

I've learned a lot in the last 48 hours.

Howardlong · « **Reply #71 on:** April 27, 2015, 10:31:52 am »

So here's a test for you ARM assembler aficionados that's had me scratching my head for a day or so now. Why does the code segment below between *** START *** and *** END *** take 231ns/4.9ns = 47 cycles? I count 38 inclusive of one of the two GPIO twiddles. The 3 three element LDM/STM take 4 cycles each (12 total), the 24 VFP instructions 24 cycles total, and one STRB for one of the two GPIOs takes 2 cycles, so 38 total.

Code is in RamLoc72, data is in RamLoc128, scope shows it to be consistent 231ns, no jitter, there are no ISRs or DMA or M0 cores running, this is it.

Edit: data is word (32 bit) aligned.

Code: [Select]

__RAMFUNC(RAM2) static void DownConvert3(NCOSTRUCT *pncos,SAMPLETYPE *pstIn,SAMPLETYPE *pstOutI,SAMPLETYPE *pstOutQ, int nNumSamples)
{

	int x=0x400F4000; // For twiddling diagnostic bits

	__asm__ __volatile__
	(
		"\n\t"
		"movs r2,#0					\n\t" // Literals for GPIO performance diagnostics
		"movs r3,#1					\n\t"

		"vldm %[NCO],{s0-s7}		\n\t" // Initialise NCOs into registers

		"mov r6,#3					\n\t" // Divide by 3: we try to do three samples unrolled
		"udiv r5,%[NUMSAMPLES],r6	\n\t" // Result in R5 // Check this udiv, takes 160ns
		"mls r6,r6,r5,%[NUMSAMPLES]	\n\t" // Remainder in R6


		"cbz r5,loopexit1			\n\t" // If there are less than three samples, do it one at a time	

		// Calculate NCOs, three at a time to avoid shuffling registers about
		"\nloop1:					\n\t"
			
"strb.w r3,[%[X],#100]	\n\t" // GPIO on
/****** START ******/

		"vldm %[IN]!,{s10-s12}		\n\t" // Load up next three samples (oscillators are IIRs with three sample delay)

		"vmul.f32 s4,s6,s0			\n\t" // Iy0=A1*Iy1 // s2=Iy0, s0=Iy1, s1=Iy2
		"vmul.f32 s5,s6,s1			\n\t" // Qy0=A1*Qy1
		"vmul.f32 s8,s7,s2			\n\t" // s8=A2*Iy2
		"vmul.f32 s9,s7,s3			\n\t" // s9=A2*Qy2
		"vadd.f32 s4,s4,s8			\n\t" // Iy0=A1*Iy1 + A2*Iy2
		"vadd.f32 s5,s5,s9			\n\t" // Qy0=A1*Qy1 + A2*Qy2

		"vmul.f32 s13,s4,s10		\n\t" // s13=Iy0*In[0]
		"vmul.f32 s16,s5,s10		\n\t" // s16=Qy0*In[0]

		"vmul.f32 s2,s6,s4			\n\t" // Iy0=A1*Iy1 // s1=Iy0, s2=Iy1, s0=Iy2
		"vmul.f32 s3,s6,s5			\n\t" // Qy0=A1*Qy1
		"vmul.f32 s8,s7,s0			\n\t" // s8=A2*Iy2
		"vmul.f32 s9,s7,s1			\n\t" // s9=A2*Qy2
		"vadd.f32 s2,s2,s8			\n\t" // Iy0=A1*Iy1 + A2*Iy2
		"vadd.f32 s3,s3,s9			\n\t" // Qy0=A1*Qy1 + A2*Qy2

		"vmul.f32 s14,s2,s11		\n\t" // s14=Iy0*In[1]
		"vmul.f32 s17,s3,s11		\n\t" // s17=Qy0*In[1]

		"vmul.f32 s0,s6,s2			\n\t" // Iy0=A1*Iy1 // s0=Iy0, s1=Iy1, s2=Iy2
		"vmul.f32 s1,s6,s3			\n\t" // Qy0=A1*Qy1
		"vmul.f32 s8,s7,s4			\n\t" // s8=A2*Iy2
		"vmul.f32 s9,s7,s5			\n\t" // s9=A2*Qy2
		"vadd.f32 s0,s0,s8			\n\t" // Iy0=A1*Iy1 + A2*Iy2
		"vadd.f32 s1,s1,s9			\n\t" // Qy0=A1*Qy1 + A2*Qy2

		"vmul.f32 s15,s0,s12		\n\t" // s15=Iy0*In[2]
		"vmul.f32 s18,s1,s12		\n\t" // s18=Qy0*In[2]

		"vstm %[OUTI]!,{s13-s15}	\n\t" // Store results
		"vstm %[OUTQ]!,{s16-s18}	\n\t"
			
"strb.w r2,[%[X],#100]	\n\t" // GPIO off
/****** END ******/

		"subs r5,#1					\n\t" // loop until no more
		"bne loop1					\n\t"

		// Here we deal with the "stragglers", ie samples beyond those divisible by three
		"\nloopexit1:				\n\t"
		"cbz r6,loopexit2			\n\t" // Nothing left to do, jump over

		"\nloop2:					\n\t"

		"vldm %[IN]!,{s10}			\n\t" // Load up next sample

		"vmov.f32 s4,s2				\n\t" // Iy2=Iy1 // Interleave I and Q instructions to prevent stalling
		"vmov.f32 s5,s3				\n\t" // Qy2=Qy1
		"vmov.f32 s2,s0				\n\t" // Iy1=Iy0
		"vmov.f32 s3,s1				\n\t" // Qy1=Qy0
		"vmul.f32 s0,s6,s2			\n\t" // Iy0=A1*Iy1
		"vmul.f32 s1,s6,s3			\n\t" // Qy0=A1*Qy1
		"vmul.f32 s8,s7,s4			\n\t" // s8=A2*Iy2
		"vmul.f32 s9,s7,s5			\n\t" // s9=A2*Qy2
		"vadd.f32 s0,s0,s8			\n\t" // Iy0=A1*Iy1 + A2*Iy2
		"vadd.f32 s1,s1,s9			\n\t" // Qy0=A1*Qy1 + A2*Qy2

		"vmul.f32 s8,s0,s10			\n\t" // s8=Iy0*In[0]
		"vmul.f32 s9,s1,s10			\n\t" // s8=Qy0*In[0]

		"vstm %[OUTI]!,{s8}			\n\t" // Store results
		"vstm %[OUTQ]!,{s9}			\n\t"

		"subs r6,#1					\n\t" // loop until no more
		"bne loop2					\n\t"

		"\nloopexit2:				\n\t"

		"vstm %[NCO],{s0-s5]		\n\t" // Save NCO state

			: [OUTI]"+r" (pstOutI), [OUTQ]"+r" (pstOutQ), [IN]"+r" (pstIn)
			: [NCO]"r" (pncos), [NUMSAMPLES]"r" (nNumSamples), [X]"r" (x)
			: "r2","r3","r5","r6","s0", "s1", "s2", "s3", "s4", "s5", "s6", "s7", "s8", "s9","s10","s11","s12","s13","s14", "s15", "s16", "s17", "s18"

	);

nctnico · « **Reply #72 on:** April 27, 2015, 02:04:15 pm »

I'd start by checking the clock frequency and then remove most of the instructions.

AndyC_772 · « **Reply #73 on:** April 27, 2015, 02:13:05 pm »

How long does it take if you duplicate all the instructions between setting the GPIO and clearing it? Does executing the code twice take an extra 47 cycles, or 36, or somewhere in between?

Jeroen3 · « **Reply #74 on:** April 27, 2015, 03:28:24 pm »

There are several reasons why assembler code that is assumed to take 38 cycles, takes 47 clocks. Few of them are:
- Bus wait states, for when you're accessing slower-clocked domains. Such as GPIO or anything on lower-clocked APB.
- Flash wait states, remember flash isn't 32 bit wide, but mostly 128 bit, so multiple instructions fit one flash fetch, and you can get out-of-sync with your gpio set/reset. Refer to memory barriers for this.
- Flash prefetching/caching. Characteristics are highly hardware dependent.

Measuring excution time using GPIO is ambigious. Compare it with the CCNT to see what happens.
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0211i/Bihcgfcf.html (demo url, not sure what hardware you have, but this register exists in most arms)

If your main goal is to program a fast assembly routine. Create a testkit in a simulator with an ideal environment. No bus waits, no flash waits, no prefetching, no pipelining. This way you only test your code without also measuring variables in hardware features that are changing with your code.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: How are micros programmed in real world situations? (Read 31867 times)

Share me