Here's the timing table. I shortened the loop to loop: ldr r1, [r0, #GPIO_ODR] /* read DATA reg */
eor r1, r2 /* complement bit */
str r1, [r0, #GPIO_ODR] /* write */
b.n loop
(NOT the shortest possible loop, BTW.) And I implemented some symbolic indirection to make it easier to change the values (code attached.) The no-wait-state code looks like it's taking about 9 clocks for the 4-instruction loop, which is not as good as I thought it would be (although I wouldn't find 6 surprising - 1 extra for the 32bit instruction and one extra for the branch.) Maybe I've left something out?
Impressive Overclocking success!
------------(MHz)--------
PLLMULT F(cpu) F(0WS) F(1WS) F(2WS)
- 8 0.444 0.400 0.364
3 24 1.330 1.200 1.091
4 32 1.779
5 40 2.225* 1.818
6 48 2.669*
7 56 3.115*
8 64 fail 3.200* 2.91
9 72 3.600* 3.27
10 80** 4.000*
11 88** 4.400*
12 96** 4.800*
13 104** 5.200* 4.73
14 112** 5.600*
15 120** 6.000*
16 128** 6.400* 5.8
* exceeds documented memory speed
** exceeds documented cpu max freq
Update: using a sequence of stores, I can get "about" 2MHz pin toggle running at 8MHz/0WS, or 18MHz at 72MHz/2WS (that's still slower than I expected (~2cycles per store), but the flash prefetch unit seems to be helping. The 72MHz version is running 9x the 8MHz version, despite increasing the waitstates to be "legal."): ldr r1, [r0, #GPIO_ODR] /* read DATA reg */
eor r1, r2, #(1<<mybit)
loop:
str r1, [r0, #GPIO_ODR] /* 1 write */
str r2, [r0, #GPIO_ODR] /* write */
str r1, [r0, #GPIO_ODR] /* 2 write */
str r2, [r0, #GPIO_ODR] /* write */
str r1, [r0, #GPIO_ODR] /* 3 write */
str r2, [r0, #GPIO_ODR] /* write */