Author Topic: Learning the STM32H745ZI dual core microcontroller (Read 3052 times)

PDP-1 · « **on:** June 22, 2024, 02:46:51 am »

Intro
I'm working on bringing up a dual core STM32H745 chip on my own custom PCB, starting with developing code on a Nucleo-H745ZI-Q dev board where I hope I can trust that the hardware works and then porting over to my board once the code is in a workable state. I'm not really a software guy by training and have only muddled around with the F429 series of STM before so I thought it might be useful to post the Nucleo code on GitHub and talk about it here in case anyone wants to follow along and give advice/feedback. I've also never used GitHub before so that will be a learning project too!

My toolchain consists of Visual Studio with the VisualGDB plug-in to run the programmer/debugger. I'm working mostly at a bare metal register level because I find that I often spend as much time trying to figure out what the auto-generated HAL code is doing as I would spend by just RTFM and working it out on my own. I do sometimes use STMCube spit out some code and then comb through it and reduce it down from pages of text to the few lines that actually do something.

Anyway, I got the Nucleo board to Blinky state on both cores today and made my first ever two GitHub repos, one for the M4 core and one for the M7 core. So far so good!

Processor Overview
The STM32H745ZI has three main parts:

A Cortex M7 core that can run up to 480MHz with dual-precision floating point math
A Cortex M4 core that can run up to 240MHz with single-precision floating point math
A region of shared memory that both cores can access to talk to each other

My reason for choosing this processor is that I'd like to have the M7 core run in almost interrupt-less mode running a hardware control algorithm in a very deterministic way while the M4 core takes care of all of the messy and unpredictable stuff like talking to the outside world and monitoring the data coming out of the M7 to make sure the system is running as intended. I was a bit worried about if I could pull that off with a single core F429 processor, plus the dual precision floating point unit on the M7 really eases some concerns about loss of precision when the control loop is running fast.

Getting to Blinky x2
Sysprogs, the developers of VisualGDB provide this guide on how to work with dual core processors. Basically it involves running two instances of Visual Studio, each containing a project for one of the cores. This works but doesn't feel like the best solution since (a) it's a bit cumbersome swapping between the instances all the time and (b) when we get to the point where the two processors start talking to each other they'll need to agree on what the messages being passed back and forth mean and that will require somehow coordinating a shared set of code files between the projects. I tried to make one Visual Studio solution containing three projects for the M4/M7/common stuff and it kind-of worked but the debugger often got confused between the two cores and crashed. So two instances of Visual Studio it is for now.

I made two instances of Visual Studio, made two empty code projects, and pulled in all of the startup files/linker scripts/etc. into each for the relevant core. (I really hate relying on outside libraries or referenced files to maintain that stuff, I like to have it all under my control.) I pulled in my own GPIO drivers from another project an had each core drive a different LED on the Nucleo on and off in a very simple way. It worked!

Next up will be starting the chip up for real, ramping up the clock tree, etc.

jnk0le · « **Reply #1 on:** June 22, 2024, 01:12:07 pm »

Quote from: PDP-1 on June 22, 2024, 02:46:51 am

My reason for choosing this processor is that I'd like to have the M7 core run in almost interrupt-less mode running a hardware control algorithm in a very deterministic way while the M4 core takes care of all of the messy and unpredictable stuff like talking to the outside world and monitoring the data coming out of the M7 to make sure the system is running as intended. I was a bit worried about if I could pull that off with a single core F429 processor,

Jitter wise cortex-m7 is less predictable/deterministic than cortex-m4.
Also, interrupts on cortexm are quite deterministic with a bit of jitter from tail chaining, late arrival, etc. and of course, more from memory subsystem. (caches, waitstates etc.)

Quote from: PDP-1 on June 22, 2024, 02:46:51 am

plus the dual precision floating point unit on the M7 really eases some concerns about loss of precision when the control loop is running fast.

https://eprint.iacr.org/2022/405.pdf

Quote

The non-constant timeness was clearly observed when generating two random
double-precision values for addition, with an average runtime of 16 clock cycles
and standard deviation of 4.1. However, when we generated random values in
the same range such they had the same exponents, the runtimes were constant
and consistant at 10 clock cycles. Moreover, when we mixed randomness from
two fixed exponent ranges we observed constant and consistant runtimes of 19
clock cycles.

Now your determinism goes out of the window.

PDP-1 · « **Reply #2 on:** June 22, 2024, 04:02:54 pm »

Quote from: jnk0le on June 22, 2024, 01:12:07 pm

Jitter wise cortex-m7 is less predictable/deterministic than cortex-m4.

Interesting, is there any known reason why that is? Maybe just the extra complexity of the M7 requiring more clock cycles to coordinate across all the different clock domains inside the chip?

Quote from: jnk0le on June 22, 2024, 01:12:07 pm

Also, interrupts on cortexm are quite deterministic with a bit of jitter from tail chaining, late arrival, etc. and of course, more from memory subsystem. (caches, waitstates etc.)

I had concerns about the memory latency, this chip has a semi complex memory layout with a bunch of potential bus masters on each one. If you got a cache miss and had to go get some data and someone else owned the bus you'd get stuck waiting for a while.

The M7 core does have a reasonably large DTCM and ITCM area at 64k each. My guess at this early stage in development is that they will be more than large enough to hold all of the time critical runtime code which should help a lot. There will always be times when you need to go access the other shared memory areas though.

Quote from: jnk0le on June 22, 2024, 01:12:07 pm

[floating point timing] Now your determinism goes out of the window.

This is good info, thanks! I did a quick estimate assuming the M7 is running at 480MHz, the loop is iterating at 100kHz, and we do 100 floating point operations per iteration. With those numbers if you got all 'good' calculations at 10 cycles, vs getting all 'bad' calculations at 19 cycles you have a range of spending 20-40% of your loop iteration time in the FPU. In the realistic case you'd get some good and some bad exponents so you'd be jittering around within that range, but since the calculations you'd be doing every cycle would be the same the real jitter would likely be over a narrower band. Still, that isn't insignificant and is definitely a thing to watch.

jnk0le · « **Reply #3 on:** June 22, 2024, 05:28:42 pm »

Quote from: PDP-1 on June 22, 2024, 04:02:54 pm

Quote from: jnk0le on June 22, 2024, 01:12:07 pm
Jitter wise cortex-m7 is less predictable/deterministic than cortex-m4.
Interesting, is there any known reason why that is? Maybe just the extra complexity of the M7 requiring more clock cycles to coordinate across all the different clock domains inside the chip?

Mostly branch predictor.

Accessing uncached peripherals in lower clocked domains increases the (clock relative) latency.

Quote from: PDP-1 on June 22, 2024, 04:02:54 pm

Quote from: jnk0le on June 22, 2024, 01:12:07 pm
Also, interrupts on cortexm are quite deterministic with a bit of jitter from tail chaining, late arrival, etc. and of course, more from memory subsystem. (caches, waitstates etc.)
I had concerns about the memory latency, this chip has a semi complex memory layout with a bunch of potential bus masters on each one. If you got a cache miss and had to go get some data and someone else owned the bus you'd get stuck waiting for a while.

The M7 core does have a reasonably large DTCM and ITCM area at 64k each. My guess at this early stage in development is that they will be more than large enough to hold all of the time critical runtime code which should help a lot. There will always be times when you need to go access the other shared memory areas though.

M4 can use D2 memories as its own ITCM/DTCM

Quote from: PDP-1 on June 22, 2024, 04:02:54 pm

Quote from: jnk0le on June 22, 2024, 01:12:07 pm
[floating point timing] Now your determinism goes out of the window.
This is good info, thanks! I did a quick estimate assuming the M7 is running at 480MHz, the loop is iterating at 100kHz, and we do 100 floating point operations per iteration. With those numbers if you got all 'good' calculations at 10 cycles, vs getting all 'bad' calculations at 19 cycles you have a range of spending 20-40% of your loop iteration time in the FPU. In the realistic case you'd get some good and some bad exponents so you'd be jittering around within that range, but since the calculations you'd be doing every cycle would be the same the real jitter would likely be over a narrower band. Still, that isn't insignificant and is definitely a thing to watch.

Note that this paper didn't explicitly state anything about denormals. Which is usually expected to be even worser than "different exponents"

Sal Ammoniac · « **Reply #4 on:** June 22, 2024, 06:37:11 pm »

Have you looked into ST's own development tool, STM32CubeIDE? I'd suspect they have a better dual core debugging solution than VisualGDB.

I haven't looked into this option yet myself, but I do have a Nucleo-H745ZI-Q sitting around waiting for me to get to it and I'll be following this thread closely.

PDP-1 · « **Reply #5 on:** June 24, 2024, 05:57:50 pm »

Quote from: Sal Ammoniac on June 22, 2024, 06:37:11 pm

Have you looked into ST's own development tool, STM32CubeIDE? I'd suspect they have a better dual core debugging solution than VisualGDB.

I haven't looked into this option yet myself, but I do have a Nucleo-H745ZI-Q sitting around waiting for me to get to it and I'll be following this thread closely.

STM32CubeIDE does have a way to do dual core debugging, I haven't tried it but there is a Controllers Tech YouTube on how to do it. I just generally like using Visual Studio more than CubeIDE, so once I learned that VisualGDB existed I chose to use it as my main tool.

Coordinated Core Startup
By default both the M4 and M7 cores start at power-on, but you may want to turn them on in a more controlled way like having only the M4 core start so it can initialize the power, clock tree, etc., before enabling the M7. This type of control can be accomplished via the non-volatile option byte SYSCFG->UR1 and the bits BCM4 and BCM7 that it contains (Boot Cortex M4/7) which are zero by default. If only one of these bits is set to one and the other is zero, only the core with its BCMx bit set to one will have a clock at power-on. If both of the bits are set to one or both are set to zero then both cores have clocks at power-on (according to AN5557, Figure 13).

There is a second mechanism to control the core clocks in RCC->GCR, bits BOOT_C2 and BOOT_C1, both zero by default (C2=M4, C1=M7). If the bits are zero control of the core clocks is given to the BCMx bits, but if one of the BOOT_Cx bits is set to one the associated core clock is turned on regardless of the state of BCMx.

So basically if you want to to only have the M4 running at power-on, set BCM4=1 and BCM7=0 in the option bytes and reboot the chip. Let the M4 configure everything and then set the BOOT_1 bit to 1 to enable the M7 when you are ready for it.

I tried doing it this way and didn't really care for the result too much because blocking the core clocks interfered with the debugger ports on my system. Instead I just mis-used the BOOT_Cx bits with both core clocks on. The M7 wakes up and just spin-waits until it sees that BOOT_C1 goes from the default value of zero to one. Meanwhile the M4 wakes up and configures everything, sets BOOT_C1 to one and then it spin-waits while looking at the value of BOOT_C2. The M7 can now do whatever configuration it needs and then sets BOOT_C2 to one to release the M4 and end the initialization phase with both cores configured and running.

PDP-1 · « **Reply #6 on:** June 24, 2024, 08:19:20 pm »

Turning up the clock speed
I want to turn the clocks to both cores up as far as they are designed for, and in general this involves three steps:
   1) Configure the internal power regulator to run at the largest voltage possible.
   2) Configure the flash memory wait states for the clock frequency you plan to use
   3) Actually turn up the clock speed.
The clock tree on this chip is huge so I'm going to just focus on the first two items for now.

Power Regulator
The chip has four possible power regulator voltage output settings, VOS3 is the lowest voltage and VOS0 is the highest. At startup the chip defaults to VOS3 at about 1.0V which can support a maximum core clock of 200MHz (see data sheet DS12923, Table 23. p. 108). VOS2 is 1.1V and 300MHz, VOS1 is 1.2V and 400MHz, and VOS0 is 1.35V and 480MHz but comes with some caveats.

The chip also has two power regulation systems that can be used either together or separately, one is an onboard switch mode regulator and the other is an onboard linear LDO regulator. You can also disable both and run off of an external regulator, bypassing the internal regulators alltogether. Figure 22 in reference manual RM0399 summarizes all of the possible configurations.

The Nucleo board is shipped configured to run with the SMPS only, no LDO (option 2 in RM0399, Figure 22). There are solder bridges that can be moved around to get other configurations. Unfortunately this default setup limits us to being able to get to VOS1 on the Nucleo board since we need the LDO on to get to VOS0. This will limit the max clock speed to 400MHz unless we modify the board. Set up the code to turn LDO off, go to VOS1, and wait until everything is stable:

Code: [Select]

void pwr_init(void)
{
	/* The nucleo board is connected with a direct SMPS step down converter supply (option 2 in RM0399 Figure 22), and this 
	 * causes the hardware to ignore attempts to set some register values, e.g. LDOEN. This has the effect of also limiting 
	 * the clock speed to 400MHz max at voltage VOS1 (DS12923, Table 23). */
	
	uint32_t timeout = 0xFFFF;

	CLEAR_BIT(PWR->CR3, PWR_CR3_LDOEN);									// turn off LDO, SMPS only
	MODIFY_REG(PWR->D3CR, PWR_D3CR_VOS_Msk, PWR_D3CR_VOS_1 | PWR_D3CR_VOS_0);// set VOS scale 1
	while ((!READ_BIT(PWR->D3CR, PWR_D3CR_VOSRDY)) && (timeout>0)) { timeout--; }	// wait for the voltage to stabilize	
	if (timeout == 0) { SYS_ERROR("VOSRDY timeout"); }
}

Flash setup
Table 16 in RM0399 shows the number of flash wait states and the programming delay parameter settings for different VOS and AXI bus frequency ranges. If the AXI bus will be running at half of the max core clock frequency of 400MHz and VOS1 the table suggests that we use WRHIGHFREQ=10 and 2 wait states.

Code: [Select]

void flash_init()
{
	/* This section configures the flash memory wait states and write/read times for the main
	 * AXI data bus. This bus runs at a max of 240MHz (usually half of the core clock speed) and 
	 * the settings depend on Vcore setpoint (VOS0-3) as shown in RM0399 Table 16.
	 * 
	 * Anticipate running at VOS1 with a 400MHz core clock when the clock chain setup is complete
	 * then the AXI bus will be at 200MHz and Table 16 says to use WRHIGHFREQ = 10 and 
	 * latency = 2 wait states */
	
	MODIFY_REG(FLASH->ACR, FLASH_ACR_WRHIGHFREQ_Msk, FLASH_ACR_WRHIGHFREQ_1);
	MODIFY_REG(FLASH->ACR, FLASH_ACR_LATENCY_Msk, FLASH_ACR_LATENCY_2WS);		
}

Next we can get ready to tackle the clock tree settings.

Code@ https://github.com/ews0404/STM32H745ZI_M4/blob/main/M4/Code/sys/system.cpp

edit: My custom PCB was set up to go all the way to VOS0 and 480MHz by using the LDO and no SMPS, the code for that is

Code: [Select]

void pwr_init()
{
	/* The incoming power is set up as shown in RM0399, Figure 22, example 1 with Vdd directly feeding the low drop out (LDO) 
	 * on-chip power regulator and the SMPS turned off. 
	 * 
	 * Go to the highest core voltate regulation mode, VOS0, which enables running at the maximum clock speed 
	 * (DS12923, Table 23) by following the VOS0 activation sequence outlined in (RM0399, 7.6.2)  
	 */
	
	uint32_t timeout = 0xFFFF;
	
	SYS_TRACE("\tpwr");
	
	// configure the power management system to run on the LDO regulator only
	CLEAR_BIT(PWR->CR3, PWR_CR3_SMPSEN);				// disable SMPS controller (called SDEN in RM0399)		
	SET_BIT(PWR->CR3, PWR_CR3_LDOEN);					// enable low drop out voltage regulator
	CLEAR_BIT(PWR->CR3, PWR_CR3_BYPASS);				// disable power management unit bypass
	
	// go to power mode VOS1 and wait for voltages to stabilize
	MODIFY_REG(PWR->D3CR, PWR_D3CR_VOS_Msk, PWR_D3CR_VOS_1 | PWR_D3CR_VOS_0);	// set VOS scale 1, max voltage
	while ((!READ_BIT(PWR->D3CR, PWR_D3CR_VOSRDY)) && (timeout>0)) { timeout--; }	// wait for it to stabilize
	if (timeout == 0) { SYS_ERROR("VOS1 timeout"); }
	
	// go to power mode VOS0 (VOS1 + ODEN) and wait for voltages to stabilize
	SET_BIT(RCC->APB4ENR, RCC_APB4ENR_SYSCFGEN);		// enable clock to the sysconfig system
	SET_BIT(SYSCFG->PWRCR, SYSCFG_PWRCR_ODEN);			// enable VOS0 overdrive mode
	while ((!READ_BIT(PWR->D3CR, PWR_D3CR_VOSRDY)) && (timeout>0)) { timeout--; }
	if (timeout == 0) { SYS_ERROR("VOS0 timeout"); }
}

PDP-1 · « **Reply #7 on:** June 25, 2024, 02:55:17 pm »

Clock Tree
Given the power restrictions of the Nucleo-H745ZI we have a maximum supportable clock frequency of 400MHz on the M7 and 200MHz on the M4 core. To get there, first map out the clock tree settings in STMCubeIDE as it makes things much easier to visualize.

General steps are:
1) Turn on the high speed external crystal (HSE) and wait for it to stabilize (8MHz)
2) Configure PLL1 to use the HSE as an input source and convert it to a 400MHz output, let it stabilize.
3) Set up the core clock prescalers (DIPRE, HPRE, etc. to give correct output at 400MHz while still running on the HSI clock
4) Move the core clock input from HSI over to the 400MHz PLL1 output, let it stabilize
5) Optional test: route the new core clock (SYSCLK) out of the MCO2 pin with a division ratio of 8, measure the 50MHz output on a scope to verify the 400MHz signal.

Code: [Select]

void hse_clock_init(void)
{
	/* initialize High Speed External (HSE) clock tree, see RM0399 Figure 51 
	 * (8MHz HSE clock input) * (/1 DIVM1) * (x100 DIVN1) * (/2 DIVP1) = (400MHz PLLCLK output) */
	
	uint32_t timeout = 0xFFFF;
	
	// turn on the High Speed External (HSE) 8MHz crystal oscillator for the main system clock, wait for it to stabilize
	SET_BIT(RCC->CR, RCC_CR_HSEON);														// turn on HSE drive
	while ((!READ_BIT(RCC->CR, RCC_CR_HSERDY)) && (timeout>0)) { timeout--; }			// wait for HSE to start
	if (timeout == 0) { SYS_ERROR("HSE timeout"); }
	
	// configure PLL1 to convert 8MHz HSE input to 400MHz DIVP1 output (RM0399 figure 54)
	MODIFY_REG(RCC->PLLCKSELR, RCC_PLLCKSELR_PLLSRC_Msk, RCC_PLLCKSELR_PLLSRC_HSE);		// configure PLL source mux to use HSE
	MODIFY_REG(RCC->PLLCKSELR, RCC_PLLCKSELR_DIVM1_Msk, 1 << RCC_PLLCKSELR_DIVM1_Pos);	// DIVM1, divide by 1
	MODIFY_REG(RCC->PLLCFGR, RCC_PLLCFGR_PLL1RGE, 0b01 << RCC_PLLCFGR_PLL1RGE_Pos);		// PLL1 input frequency range 4-8MHz
	CLEAR_BIT(RCC->PLLCFGR, RCC_PLLCFGR_PLL1VCOSEL); 									// PLL1 wide-range VCO (192-960MHz)
	CLEAR_BIT(RCC->PLL1FRACR, RCC_PLL1FRACR_FRACN1);									// do not use fractional mode
	MODIFY_REG(RCC->PLL1DIVR, RCC_PLL1DIVR_N1_Msk, 99 << RCC_PLL1DIVR_N1_Pos);			// DIVN1, multiply by 100
	MODIFY_REG(RCC->PLL1DIVR, RCC_PLL1DIVR_P1_Msk, 1 << RCC_PLL1DIVR_P1_Pos);			// DIVP1, divide by 2
	SET_BIT(RCC->PLLCFGR, RCC_PLLCFGR_DIVP1EN);											// enable DIVP1 output (400MHz SYSCLK)
	CLEAR_BIT(RCC->PLLCFGR, RCC_PLLCFGR_DIVQ1EN);										// disable DIVQ1 output
	CLEAR_BIT(RCC->PLLCFGR, RCC_PLLCFGR_DIVR1EN);										// disable DIVR1 output
	SET_BIT(RCC->CR, RCC_CR_PLL1ON);													// enable PLL1
	while ((!READ_BIT(RCC->CR, RCC_CR_PLL1RDY)) && (timeout>0)) { timeout--; }			// wait for PLL1 to start
	if (timeout == 0) { SYS_ERROR("PLL1 timeout"); }
	
	// configure system clock prescalers (RM0399 Figure 55)
	MODIFY_REG(RCC->D1CFGR, RCC_D1CFGR_D1CPRE_Msk, 0b0000 << RCC_D1CFGR_D1CPRE_Pos);	// D1 domain core prescaler = divide by 1 (400MHz, D1CPRE)
	MODIFY_REG(RCC->D1CFGR, RCC_D1CFGR_HPRE_Msk, 0b1000 << RCC_D1CFGR_HPRE_Pos);		// D1 domain AHB  prescaler = divide by 2 (200MHz, HPRE)
	MODIFY_REG(RCC->D1CFGR, RCC_D1CFGR_D1PPRE_Msk, 0b100 << RCC_D1CFGR_D1PPRE_Pos);		// D1 domain APB3 prescaler = divide by 2 (100MHz, D1PPRE)
	MODIFY_REG(RCC->D2CFGR, RCC_D2CFGR_D2PPRE1_Msk, 0b100 << RCC_D2CFGR_D2PPRE1_Pos);	// D2 domain APB1 prescaler = divide by 2 (100MHz, D2PPRE1)
	MODIFY_REG(RCC->D2CFGR, RCC_D2CFGR_D2PPRE2_Msk, 0b100 << RCC_D2CFGR_D2PPRE2_Pos);	// D2 domain APB2 prescaler = divide by 2 (120MHz, D2PPRE2)
	MODIFY_REG(RCC->D3CFGR, RCC_D3CFGR_D3PPRE_Msk, 0b100 << RCC_D3CFGR_D3PPRE_Pos);		// D3 domain APB4 prescaler = divide by 2 (120MHz, D3PPRE)
	
	// set system clock mux input to use PLL1, DIVP1 as a source (400MHz) (RM0399 9.7.6)
	MODIFY_REG(RCC->CFGR, RCC_CFGR_SW_Msk, RCC_CFGR_SW_PLL1 << RCC_CFGR_SW_Pos);		// set system clock mux input to PLL1, DIVP1 
	while ((RCC->CFGR & RCC_CFGR_SWS_Msk) != (RCC_CFGR_SW_PLL1 << RCC_CFGR_SWS_Pos) && (timeout > 0)) { timeout--; } 
	if (timeout == 0) { SYS_ERROR("system clock mux timeout"); }
	
//	// test code - route SYSCLK/8 out of MCO2 pin (RM0399 9.7.6) to verify the frequency (expected 50MHz, measured 50.05MHz)
//	static pinDef mco2Pin	= { .port = GPIOC, .pin = PIN_9,  .mode = Alternate, .type = PushPull, .speed = High, .pull = None, .alternate = AF0 };	
//	MODIFY_REG(RCC->CFGR, RCC_CFGR_MCO2_Msk, 0b000 << RCC_CFGR_MCO2_Pos);				// select SYSCLK as input 
//	MODIFY_REG(RCC->CFGR, RCC_CFGR_MCO2PRE_Msk, 8 << RCC_CFGR_MCO2PRE_Pos);				// divide input by 8 to get 50MHz on MCO2 pin
//	configurePin(mco2Pin);																// configure and enable pin as MCO2 output
}

That was honestly a lot less work than I thought it was going to be when I first looked at the massive clock tree in STM32CubeIDE, but now I realize that most of the options are for peripherals that we are not using yet so we really only need to work on a fraction of what is there.

As long as I'm in clock tree land I might as well enable the low speed external (LSE) 32.768kHz crystal in case I ever feel like enabling the real time clock (RTC) in the future.

Code: [Select]

void lse_clock_init(void)
{
	uint32_t timeout = 0xFFFF;
	
	// turn on the Low Speed External (LSE) 32.768kHz crystal oscillator for the real time clock, wait for it to stabilize
	SET_BIT(RCC->CSR, RCC_CSR_LSION);													// turn on LSE drive
	while ((!READ_BIT(RCC->CSR, RCC_CSR_LSIRDY)) && (timeout>0)) { timeout--; }			// wait for LSE to start
	if (timeout == 0) { SYS_ERROR("LSIRDY timeout"); }
}

Code at: https://github.com/ews0404/STM32H745ZI_M4/blob/main/M4/Code/sys/system.cpp

PDP-1 · « **Reply #8 on:** June 26, 2024, 06:23:12 pm »

I added some boring but necessary small items - enabled the NVIC, enabled the FPU, and added SysTick. These all work pretty much the way they did on the F4xx chips.

I also did some much-needed file re-arranging so that there is now a Common file in the M4 repo that the M7 references, it contains code that is used by both cores so we don't need to manage two copies anymore.

With that housekeeping done it's time to add a fun new (to me, anyway) peripheral:

Hardware Semaphores (hsem)
Hardware semaphores provide a mechanism to safely share resources between the two processor cores or between different processes on the same core. There are 32 hsem modules on this chip, you can assign them to represent ownership of whatever resources you want. If a resource is not in use by anyone, the corresponding hsem value should be zero. Then if core/process A requests a lock the hsem hardware will immediately grant it, making the process atomic and the value read back from the hsem will have a bit indicating that the hsem is locked =1 and a coreID and processID equal too the IDs of the core/process that just requested exclusive access to the resource. If core/process B requests the same hsem it reads back that core/process A already owns it so B's request is denied. Core/process A can then unlock the hsem when it is done, returning the hsem to the zero/unowned state.

There is a two-step locking process that depends on knowing the coreID and the processID where you write a request for the lock and read back the hsem value to see if you got it. If you don't have more than one process running on each core there is a one-step locking process where you just read the hsem value and the hsem hardware infers which core you are by looking at who the bus master asking for its value is.

You can set up interrupts that go off when a hsem is unlocked so you can get notified when the corresponding resource becomes available. There is also a mechanism by which a core can clear all of the hsems that it owns at once.

I don't plan to have more than one process running on each core, so I just implemented the one-step lock/unlock process for now (hsem.h, hsem.cpp).

This is covered in reference manual RM0399, section 11.

edit: I forgot to mention an interesting bug that happened. The hsem grants a lock when it's memory location is read, but it doesn't matter WHO is doing the reading. If you try to look at the hsem registers with a debugger, that counts as a read and the debugger gets granted a lock on all the (unlocked) hsems that it looks at!

jnk0le · « **Reply #9 on:** June 30, 2024, 08:29:31 pm »

Quote from: jnk0le on June 22, 2024, 05:28:42 pm

Quote from: PDP-1 on June 22, 2024, 04:02:54 pm
Quote from: jnk0le on June 22, 2024, 01:12:07 pm
[floating point timing] Now your determinism goes out of the window.
This is good info, thanks! I did a quick estimate assuming the M7 is running at 480MHz, the loop is iterating at 100kHz, and we do 100 floating point operations per iteration. With those numbers if you got all 'good' calculations at 10 cycles, vs getting all 'bad' calculations at 19 cycles you have a range of spending 20-40% of your loop iteration time in the FPU. In the realistic case you'd get some good and some bad exponents so you'd be jittering around within that range, but since the calculations you'd be doing every cycle would be the same the real jitter would likely be over a narrower band. Still, that isn't insignificant and is definitely a thing to watch.

Note that this paper didn't explicitly state anything about denormals. Which is usually expected to be even worser than "different exponents"

Denormals will add another cycle to addition, multiplication and FMAs are taking longer so those "20-40%" will exceed 100%
https://github.com/jnk0le/random/tree/master/pipeline%20cycle%20test#fpu-double-precision, or other source with similar numbers: https://www.quinapalus.com/cm7cycles.html

PDP-1 · « **Reply #10 on:** July 23, 2024, 02:23:25 pm »

I had to take a break from this project to work on other things, back now.

Message Queues
We need to have a way for the M4 and M7 processors to talk to each other and for now that will be done via a pair of message queues, one for sending data M4->M7 and the other for M7->M4. We can use the hardware semaphore mechanism to control access so that only one processor is reading/writing to a queue at a time.

files: messageQueue.h, messageQueue.cpp

Each queue can contain multiple Message objects which is just a 32 bit header divided into a 16 bit MessageID enum that tells us what kind of data is being sent and a 16 bit data length field stating how many bytes are being sent (including the header). Just to bound the problem a bit the max Message size is 0x600 (1536d) bytes and the MessageQueue is sized to hold multiple max-length Messages.

The MessageQueues track how many unread messages they contain, how many bytes are used, and the max values either of those fields have ever seen. Data is read FIFO from a circular buffer. Finally, each MessageQueue is initialized with the ID of the hardware semaphore that controls access to its buffer:

Code: [Select]

struct MessageQueue {
	uint32_t pendingMessages;				// the number of messages in the queue waiting to be processed
	uint32_t maxPendingMessages;				// the largest number of pending messages ever in the queue at once
	uint32_t bytesInQueue;					// number of bytes currently in the queue
	uint32_t maxBytesInQueue;				// the largest number of bytes ever contained in the queue
	uint32_t head;						// buffer index where the next byte should be written
	uint32_t tail;						// buffer index where the next byte should be read
	HSEM_ID hsemID;						// hardware semaphore controlling access to this queue
	uint8_t buffer[MQ_MESSAGE_QUEUE_SIZE];			// the queue data buffer
};

The way the MessageQueues get created feels a bit janky to me, first I go to the linker file for each core and carve out part of the SRAM4 memory that is accessible to both processors and then export _sram4_mq as the start of that address space:

Code: [Select]

MEMORY
{
	FLASH (RX)   : ORIGIN = 0x08100000, LENGTH = 1M
	RAM_D2 (RWX) : ORIGIN = 0x10000000, LENGTH = 288K
	RAM_D3 (RWX) : ORIGIN = 0x18000000, LENGTH = 64K
	SRAM4 (RWX)  : ORIGIN = 0x38000000, LENGTH = 32K
	SRAM4_MQ (RWX) : ORIGIN = 0x38008000, LENGTH = 32K		/* reserve half of SRAM4 for message queue */
}

_estack = ORIGIN(RAM_D2) + LENGTH(RAM_D2);  /* 0x10048000 */
_sram4_mq = ORIGIN(SRAM4_MQ);				/* 0x38008000 */

Next, each processor imports __sram4_mq and then decides that an array of MessageQueue objects liver there:

Code: [Select]

extern void* _sram4_mq;
static MessageQueue* mq = (MessageQueue*)&_sram4_mq;

Then there is a MessageQueueID enum that is zero for the M4->M7 queue and one for M7->M4, that value is used to index our imaginary arry of MessageQueues living at the _sram4_mq address. This feels like a strange way to set things up and I'm very open to other ideas, but it does have the advantage that each processor agrees on exactly where the MessageQueues live in memory and the linker for each processor knows not to put anything else in that memory range.

The MessageQueue send operation spin-waits until it can obtain a hardware semaphore lock on the queue, then writes in the message header and data, increments the number of unread messages and updates the byte count, then unlocks the queue and exits.

The MessageQueue read operation is pretty similar - spin-wait until it gets a hardware semaphore lock, copy the header and data into a private buffer, decrement the number of unread messages and update the byte count, then release the lock and exit.

Finally, there is a method to quickly check the number of unread messages so you know if you need to do a read or not that does not involve waiting for a lock.

I tested the queue in both directions with some dummy Message types that let one processor ask the other to blink an LED or send some text out the debug port, it ran for long enough to wrap around the data buffer multiple times with no problem. There are plenty of improvements to be made (switching from reading/writing one byte at a time to 32 bit data transfers is the obvious one) but it's good enough for now.

betocool · « **Reply #11 on:** July 24, 2024, 01:14:16 pm »

If you're using message queues and two processors and semaphores and a bunch of other stuff, why would you not use FreeRTOS? At 480Mhz I doubt latency would be an issue...

Cheers,

Alberto

PDP-1 · « **Reply #12 on:** July 28, 2024, 08:35:44 pm »

Quote from: betocool on July 24, 2024, 01:14:16 pm

If you're using message queues and two processors and semaphores and a bunch of other stuff, why would you not use FreeRTOS? At 480Mhz I doubt latency would be an issue...

Cheers,

Alberto

In my future application I'm hoping to eventually use the M7 processor as a kind of mini-FPGA, sampling an external ADC and doing some light signal processing at ~100kHz rates with a hardware timer kicking off each cycle. For that task an RTOS would be pretty heavyweight and wouldn't be able to handle the timing.

The M4 processor on the other hand is intended to handle all of the 'messy' stuff like IO to the outside world, etc. and could possibly benefit from an RTOS type setup. That might be an interesting thing to look in to, thanks!

gpr · « **Reply #13 on:** July 29, 2024, 11:26:30 am »

"mini-FPGA" does not describe your requirements well. Probably you have tight timing/jitter requirements for processing time, from sample being collected from ADC until it is processed and ready for some kind of output.

If you use RTOS, it doesn't prevent you from using all the hardware directly, you can still have your interrupts, ADC, DMA and everything else. But it adds you a possibility to offload some less critical tasks to background tasks. You can offload batch processing of your samples to separate task, for example, communicate via uart and blink a LED effortlessly, without affecting your realtime processing.

dietert1 · « **Reply #14 on:** July 29, 2024, 06:43:27 pm »

On youtube there is a "Controllerstech" series of videos about the dual core STM32H7. Look for "controllerstech dual core". I found it interesting.
Certainly FreeRTOS needs some time to acquire experience, but there are enough examples around and it may easily remove the need for the second CPU. First one may think that a 1 msec task preemption cycle is much to long at signal rates of 100 KHz, but the 1 msec isn't a hard limit, as FreeRTOS gives you a lot of control.
I do understand the safety aspect of having the second CPU as a supervisor. But it can also result in a more difficult and eventually unstable design.

Regards, Dieter


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Learning the STM32H745ZI dual core microcontroller (Read 3052 times)

PDP-1

Learning the STM32H745ZI dual core microcontroller

jnk0le

Re: Learning the STM32H745ZI dual core microcontroller

PDP-1

Re: Learning the STM32H745ZI dual core microcontroller

jnk0le

Re: Learning the STM32H745ZI dual core microcontroller

Sal Ammoniac

Re: Learning the STM32H745ZI dual core microcontroller

PDP-1

Re: Learning the STM32H745ZI dual core microcontroller

PDP-1

Re: Learning the STM32H745ZI dual core microcontroller

PDP-1

Re: Learning the STM32H745ZI dual core microcontroller

PDP-1

Re: Learning the STM32H745ZI dual core microcontroller

jnk0le

Re: Learning the STM32H745ZI dual core microcontroller

PDP-1

Re: Learning the STM32H745ZI dual core microcontroller

betocool

Re: Learning the STM32H745ZI dual core microcontroller

PDP-1

Re: Learning the STM32H745ZI dual core microcontroller

gpr

Re: Learning the STM32H745ZI dual core microcontroller

dietert1

Re: Learning the STM32H745ZI dual core microcontroller

Share me