Author Topic: Bizarre arduino behaviour, I actually don't understand microcontrollers.  (Read 2278 times)

0 Members and 1 Guest are viewing this topic.

Offline John BTopic starter

  • Frequent Contributor
  • **
  • Posts: 824
  • Country: au
I've spent the last few days debugging a sketch on a LED controller. Basically there are buttons attached through an I2C IO expander which allow for manual control of the PWM outputs, eg on/off switch, increase or decrease duty cycle in 20 discrete steps. The device can also be read/written over the serial interface using the modbus protocol. Instead of a master device (raspberry pi)  polling the micro at regular intervals, the micro has an "interrupt" output that signals the RPi to perform a read operation. This is activated when you press any button on the micro.

This is an abbreviated version of the sketch:

Code: [Select]
void loop() {

    if (ModbusRTUServer.poll()){
        //If there's a modbus transaction matching the slave address, process the registers for updated values
        process_modbus_stuff();
        digitalWrite(INTERRUPT_PIN, LOW); //Clear the read trigger
        }
       
    if (button_press() == true){
        process_buttons(); // Call any functions associated with each button
        write_modbus_registers(); // Update the modbus registers so they can be read externally
        digitalWrite(INTERRUPT_PIN, HIGH); //Triggers a read operation from an RPi
        }
    }

I noticed some odd behaviour when trying to read via modbus. Take for example when pressing the dimmer increase/decrease buttons. After pressing one of those buttons the following would happen:

  • Next time through loop(), button_press() would return true
  • process_buttons() would check which button was pressed and call the dimmer increase/decrease functions inside a light Class
  • Inside the light Class, a dimmer_increase() function would increase a counter between 0-20, similar for dimmer_decrease()
  • That 0-20 value would be processed into a 0-255 PWM value and written to the output pins using analogWrite()
  • The values inside of the light Class would then be copied to modbus registers.
  • INTERRUPT_PIN would be then written high
  • Typically 1.5ms later, the RPi would send a read request, and would allow me to see the values of the light Class

This would all work perfectly fine without connecting the RPi (so the interrupt pin would just stay high after the first button press, because there would be no modbus transactions to reset the pin). I was also able to confirm that the functions were working correctly by just controlling a light, but also by manually performing a read operation after pressing a button to see that the registers are showing the correct values.

However things started behaving strangely once the RPi was also performing a read operation in response to the interrupt pin. As previously mentioned this was around 1.5ms after the pin goes high. The dimmer would only allow me to cycle between 0 or 1 steps, equating to a PWM value of 0-12 (which is 255 / 20 * 1). This made no sense to me, as the counter variable was internal to the light Class, and my understanding is that process_buttons() would complete all processes before write_modbus_registers() or digitalWrite(INTERRUPT_PIN, HIGH), and even then, there's a whopping 1.5ms after the digitalWrite before any modbus transaction occurs.

As soon as I stopped the RPi from doing any read operations, everything would work fine again.

On a hunch, I simplified my code so that it bypassed the 0-20 to PWM 0-255 processing, so a button press would perform a simple count += 1; or count -= 1  operation to increment or decrement the counter. Strangely, everything now worked, I could read the counter going between 0-20 using the RPi.

The cause of this is beyond my understanding of what the microcontrollers are doing. My understanding is that my functions should execute to completion before moving to the next function. Yet all I changed was removing some mathematical operations to speed up process_buttons().

Confusing? It gets worse.

I know that division operations are generally undesirable as they are quite slow. The 0-20 counter to 0-255 pwm value was given using:
  • dimmer_0_255 =  255 * current_step / total_steps;

So I had the idea of removing all the division operations from my light Class. I could just precalculate the values I needed and store them in an array, where a read operation would be much faster than division. So a 0-20 counter could return a 0-255 conversion by simply reading an array with the stored values {0, 12, 25, 38, 51 ........}. So I made the replacements to my first example, and tried it without the RPi doing any automatic reads. It worked fine independently, and I could see that it was showing the correct values by doing manual reads after pressing the buttons.

So I started the RPi automatic reader................and now the micro controller would count between 0 and about 9 steps  |O |O |O  :-DD :-DD Like bro what. Again, the second I stopped the RPi script, the micro would function as normal. How is any of this possible??

At this point I had completely optimised my light class functions, there were no divisions, only basic math, memory accessing and bit shifting. But even then, how is it possible that something which should occur ~1.5ms after completing all functions could have any effect?

In the end, "fixed" the problem on a completely different hunch. write_modbus_registers() involves calling the ModbusRTUServer.holdingRegisterWrite() function, and process_modbus_stuff() involves calling the ModbusRTUServer.holdingRegisterRead() function.

Because ModbusRTUServer.poll() doesn't distinguish between read or write transactions, my function was re-writing the modbus registers to the light class even on read requests. I guessed that the modbus registers were not being updated before ModbusRTUServer.poll() was called, thereby resetting the previous dimmer change. Again, that shouldn't matter? Since it looks like ModbusRTUServer.holdingRegisterWrite() just writes to an array, therefore this should be completed before loop() starts again. Not to mention the completely aberrant behaviour of the internal counter in the light Class not being incremented or decremented properly.

It should be mentioned that throughout all of this, I could still write to the micro just fine via modbus, ie set the dimmer levels via writing to the registers, so I don't think there's any issues with my modbus processing functions..

This has really shaken up my understanding of what is going on under the hood. I'm not a low level programmer. I know the arduino environment is highly abstracted and hides away nearly all the underpinning code. I've guessed as much in looking at how the serial interfaces operate in the background. But from this experience, I can surmise that when I write code in order, I can't guarantee that the code will be executed or completed in order?
« Last Edit: October 04, 2024, 11:34:12 pm by John B »
 

Offline AussieBruce

  • Regular Contributor
  • *
  • Posts: 61
  • Country: au
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #1 on: October 05, 2024, 12:21:57 pm »
Based on my experience, it's time to back out and reimplement, or reenable your functionality incrementally, testing thoroughly at every stage. Note also that whilst arduino does a fine job of hiding the complexities of 'the metal', in the process it utilises a large amount of middleware code, which AFAIAA is all Opensource. That's particularly the case if you're using libraries, more than once when I've mentioned a specific library in a post, people with a lot more nous than me have come back and cautioned me on its use, and often suggested a better alternative.

Unexpected or hidden blocking within libraries can cause problems when you have multiple timed functions in play, unfortunately in some cases the only way to find out is to dig into the open source files (Copilot will tell you how to do that, but unless you're pretty platform literate it  may be a hard slog). Also, if you're enabling interrupts yourself without a lot of due care, that's a likely cause.

I presume you have debounced your buttons?
 

Offline hans

  • Super Contributor
  • ***
  • Posts: 1689
  • Country: nl
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #2 on: October 05, 2024, 01:02:54 pm »
Which Arduino are you using to run this sketch?

If a certain code path runs in the background which suddenly has side effects in your part of code, then it's time to start looking beyond C's theoretical safegarden and think more about platform specifics.

For example, if you use the Arduino Uno, that part only has 2kB of RAM. If that includes a heap, a bunch of globals, and then a growing stack that could introduce problems.

One of those triggers for growing stack usage could be the modbus library that runs inside the interrupt routine. You would have to add the maximum stack usage for both your main() plus the interrupt routine together to get the worst case. Although, if you can safeguard that 1.5ms to make sure main() is in loop, it will probably have some playroom. But it could also be that just the interrupt routine is simply too big.

Adafruit has an article about this: https://learn.adafruit.com/memories-of-an-arduino/measuring-free-memory

I've been browsing around in the Arduino modbus library, and has plenty of functions that use tons of stack space:

https://github.com/arduino-libraries/ArduinoModbus/blob/master/src/libmodbus/modbus.c#L766

Here MAX_MESSAGE_LENGTH is 260 bytes.

I'm not sure if this your problem, but it's the first thing that comes to mind if running a seemingly unrelated codepath messes with other code.
 

Offline John BTopic starter

  • Frequent Contributor
  • **
  • Posts: 824
  • Country: au
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #3 on: October 05, 2024, 11:44:52 pm »
After thinking it was fixed, it was in fact not fixed.

But now I think I've fixed it. Firstly, yes the buttons are analogue debounced (in a rather cool way, with a fast on time constant, but slow off time constant). Anyways I digress.

I'm using a genuine Arduino Nano Every. As an aside, I did previously have issues with the Arduino Modbus library on a clone Nano using the ATmega328p. As the sketch grew, I implemented some timed functions using the millis() call. Once I implemented those, garbled data starting being written to the modbus registers. Those issues went away when moving to the Nano Every.

Back to the current sketch, in my abbreviated example, process_modbus_stuff() writes the modbus registers to the light classes, while write_modbus_registers() writes the the light class data to the modbus registers.

I tried disabling process_modbus_stuff() to make the micro a read-only device via modbus, and surely enough all the counter issues were solved. I did look at my code and noticed that process_modbus_stuff() would be rewriting back into the light classes anytime the light classes wrote into the modbus registers. I added a copy of the registers in order to do a comparison, and only update the light classes if there was a difference in the modbus registers (ie, they had been changed in response to a external write register command).

Once I re-implemented process_modbus_stuff(), the counters work again. In theory, this shouldn't have made a difference, as it would be something equivalent to:

Code: [Select]
light_variable = 128;
modbus_array[0] = light_variable;
light_variable = modbus_array[0];
//light_variable should still be equal to 128

From the source code, ModbusRTUServer.holdingRegisterWrite() just looks like it writes to an array, but I'm not an experienced C programmer, so I don't understand the complexities of the classes it belongs to, and how they interact with interrupts, like from the serial interface for example.

Also, the function can just be called as is with a register number and value as arguments ModbusRTUServer.holdingRegisterWrite(0, 128); for example.

But it can also return 1 for success, 0 for failure. So, even though this wasn't necessary to fix the program, I did experiment with replacing all the calls with:

Code: [Select]
while(!ModbusRTUServer.holdingRegisterWrite(register, value)){

  }

It worked fine too.

If there is some quirk with the functions, would the while loop force the program to stay at that point in the program until the holdingRegisterWrite function completes successfully?

In this case, the register values are hard coded, so the function can never be passed invalid register numbers, which would halt the program.
« Last Edit: October 05, 2024, 11:49:12 pm by John B »
 

Offline John BTopic starter

  • Frequent Contributor
  • **
  • Posts: 824
  • Country: au
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #4 on: October 06, 2024, 03:57:03 am »
Certainly....the RPi and multiple micros all communicate over an opto isolated RS485 interface.
 

Online Siwastaja

  • Super Contributor
  • ***
  • Posts: 8864
  • Country: fi
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #5 on: October 06, 2024, 09:55:56 am »
You just have bugs. Probably nothing special of confusing related to microcontroller per se* ... just that software development is difficult. You are confused because you have not developed a systematic approach to find bugs and are working on too high level and just seeing wrong high-level behavior but not the origin of the problem.

You need to start by adding visibility into system. A generally good idea is to add logging to functions, simplest is to print to some short messages to UART that show which function was called with which arguments. Start coarse until you find the place where something goes wrong.

*) exception to that would be shared context between ISR(s) requiring atomic access and volatile qualifier, something which doesn't hit you on single-threads desktop applications.
 
The following users thanked this post: nctnico, thm_w

Offline John BTopic starter

  • Frequent Contributor
  • **
  • Posts: 824
  • Country: au
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #6 on: October 07, 2024, 12:08:36 am »
That doesn't help. I've been working on the code over a couple of years now in various forms, and the fact is that there are some quirks when it comes to setting up the I2C interface and writing bytes throughout the setup function, and reproducible bugs in the ArduinoModbus library when used on more limited devices like the 328 that can't be hand-waved away as errors in the top level code. Something which I had documented previously:

......
I also ran into some weird issues on an arduino nano, despite the compiler showing sufficient memory and ram, where coil and register memory addresses were being overwritten by other functions, so you may be limited in sketch complexity.
......

As with any other problem, finding a post from someone who has had experienced and/or solved an issue can same a lot of time and headache.
 

Online Siwastaja

  • Super Contributor
  • ***
  • Posts: 8864
  • Country: fi
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #7 on: October 07, 2024, 04:24:45 pm »
like the 328 that can't be hand-waved away as errors in the top level code.

Not errors in top level code, errors in bottom level code! You have to get your hands dirty and dive into the code until you find what is broken.

Quote
I also ran into some weird issues on an arduino nano, despite the compiler showing sufficient memory and ram, where coil and register memory addresses were being overwritten by other functions, so you may be limited in sketch complexity.

Memory corruption issues are nasty, but still not undebuggable. Like, create a function which compares a corrupted location to known-good test value you inject, and prints an error (maybe with a number argument like if(val != expected) printf("error happened at location %d\n", argument);) Call this all over your codebase until you find the spot where memory corrupts. Then add more calls between the known-good and known-bad call until you find the exact place in code which causes corruption.

There are other ways too, and sometimes bugs are Heisenbugs but systematic approach is the best way forward.

Quote
As with any other problem, finding a post from someone who has had experienced and/or solved an issue can same a lot of time and headache.

Yes, we all have struggled with memory corruption. "Why modbus is not working" is the wrong question, because memory corruption might not have anything to do with modbus, so all attempts to fix the issue by looking at superficially similar modbus problems are likely doomed to fail*. Right question is "how do I debug memory corruption", and to that I have offered some ideas (and many more are available to you).

*) unless it's a very well known library bug, in which case it's weird it's not fixed
« Last Edit: October 07, 2024, 04:29:23 pm by Siwastaja »
 
The following users thanked this post: thm_w

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 3929
  • Country: us
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #8 on: October 07, 2024, 04:59:23 pm »
Siwastaja is correct.  Your code has bugs.  Just because it has worked before doesn't mean it doesn't have bugs.  One possibility based on a brief reading of your symptoms is writing to stack variables after the function has returned.  For instance if you allocate something on the stack and store a pointer to it in a global variable, then access later, possibly from an ISR.

The "change something to see if it makes the code work" is exactly the wrong way to develop code.  If you do this, you will only fix the bug by chance.  You will just as often hide the bug temporarily, then it will come back after unrelated changes and you will be playing whack-a-mole forever.

Instead, the cycle you need to follow is: reproduce, isolate, and analyze.  First, find a way to reproduce your bug.  This can be either finding a test case that always fails, or finding instrumentation that at runtime detects the signature of the problem.  Then, isolate the bug by reducing the amount of code involved to the minimum possible.  Then use a combination of logging, online debuggers, and even mentally stepping through your code with pen and paper, figure out exactly what is happening -- not just where it is going wrong, but understand why the behavior you observe is happening.  Then you fix the problem such that it can never happen again, and keep going.

Making seemingly random changes and testing whether that changes the behavior can be useful for helping to isolate or analyze the problem, but unless you drill down the the exact place where the runtime state is corrupted, it's not a bug fix.

Another area where microcontrollers are sadly lacking is sanitizers.  These are helpful for short circuiting a lot of this cycle and finding obvious bugs like using an out-of-scope memory buffer.  Unfortunately, microcontrollers generally don't have the hardware resources to do this in hardware, and few people have an emulation platform that is good enough to run on a computer that does.
 
The following users thanked this post: Siwastaja

Offline thm_w

  • Super Contributor
  • ***
  • Posts: 7223
  • Country: ca
  • Non-expert
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #9 on: October 08, 2024, 12:20:23 am »
Another area where microcontrollers are sadly lacking is sanitizers.  These are helpful for short circuiting a lot of this cycle and finding obvious bugs like using an out-of-scope memory buffer.  Unfortunately, microcontrollers generally don't have the hardware resources to do this in hardware, and few people have an emulation platform that is good enough to run on a computer that does.

Yeah, static analysis can help here but the good options are paid.
Best if OP either posts their full code to be analyzed, or follows the advice above.
Profile -> Modify profile -> Look and Layout ->  Don't show users' signatures
 

Offline Sacodepatatas

  • Regular Contributor
  • *
  • Posts: 105
  • Country: es
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #10 on: October 08, 2024, 01:45:15 am »
Another area where microcontrollers are sadly lacking is sanitizers.  These are helpful for short circuiting a lot of this cycle and finding obvious bugs like using an out-of-scope memory buffer. 

Yeah, it's very disappointing when you are working out a severe bug, that is, trap exceptions, dissect the code into small pieces , replicate the failure in another piece of code by grafting the sensitive parts of your code into a working setup until it fails, ...

Till you notice that your code is fine but the memory boundaries are wrong at compiling time because the MCU shares memory or not depending whether a certain device IS enabled or not, as it happened to me here...
 

Offline bson

  • Supporter
  • ****
  • Posts: 2464
  • Country: us
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #11 on: October 08, 2024, 08:24:21 pm »
Another option is to mock out the hardware specific functions and compile to run from the command line on Linux.  Then valgrind it to the bone - this will catch lots and lots and lots of problems you never even imagined existed.  Lots of false positives too, but generally those arise from dubious practices to begin with.
« Last Edit: October 08, 2024, 08:28:15 pm by bson »
 

Offline ejeffrey

  • Super Contributor
  • ***
  • Posts: 3929
  • Country: us
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #12 on: October 08, 2024, 08:54:55 pm »
Yes.  This can be difficult if you don't design with that in mind: you want to keep the IO functions isolated and don't have random explicitly register writes all over.  Which is good practice but can seem daunting to clean up if you didn't start like that.

One thing that can help this is to replace the main event loop with a sequential program that reproduces your test case.
 

Offline bson

  • Supporter
  • ****
  • Posts: 2464
  • Country: us
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #13 on: October 15, 2024, 11:55:45 pm »
This is one time when C++ is handy... if all code for say a UART or RTC is in a single class it's pretty easy to swap it out for a mock class that implements some sort of send-expect or other scripted behavior and then outputs a log to help track down functional problems.  As long as it implements all the same public functions it's an easy substitution.
 

Offline pqass

  • Frequent Contributor
  • **
  • Posts: 918
  • Country: ca
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #14 on: October 16, 2024, 12:38:00 am »
Or... you could just wrap select variables/functions (whichever file they may be found in) with:

Code: [Select]
#ifdef USEDUMMY 
int func(...) { return 0; }
#else ...
void func(...) {
    ...real code...
}
#endif

$ gcc ... -DUSEDUMMY ...

No, it's not elegant but can be done now without reordering/relocating the existing code.
 

Online Siwastaja

  • Super Contributor
  • ***
  • Posts: 8864
  • Country: fi
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #15 on: October 16, 2024, 09:41:27 am »
This is one time when C++ is handy... if all code for say a UART or RTC is in a single class it's pretty easy to swap it out for a mock class that implements some sort of send-expect or other scripted behavior and then outputs a log to help track down functional problems.  As long as it implements all the same public functions it's an easy substitution.

I don't understand how this swapping is any easier in C++ than, for example, in C. As long as the implementation is a separate compilation unit and uses the same interface, it is indeed very easy to swap, and swapping it happens exactly the same way in C or C++ (replacing the file to be compiled/linked). Swapping isn't the problem, creating and maintaining two sets of modules, one for actual HW one for simulation is all the work.

Practice of abstracting HW-dependent parts out and creating "simulation" versions of the those code modules is a very good idea, and probably saves back the time spent, plus results in higher quality more reliable product. But this requires some self-discipline, and sometimes it's simply too time-consuming to do properly, e.g. how do you simulate ADC and create analog input values with correct timing etc.?

But yeah, at least pick any low-hanging fruits. e.g. ASCII UART interfaces are easy to test using stdin & stdout and there is no excuse not to do that bare minimum.
 

Online Siwastaja

  • Super Contributor
  • ***
  • Posts: 8864
  • Country: fi
Re: Bizarre arduino behaviour, I actually don't understand microcontrollers.
« Reply #16 on: October 16, 2024, 09:44:40 am »
Or... you could just wrap select variables/functions (whichever file they may be found in) with:

Code: [Select]
#ifdef USEDUMMY 
int func(...) { return 0; }
#else ...
void func(...) {
    ...real code...
}
#endif

$ gcc ... -DUSEDUMMY ...

No, it's not elegant but can be done now without reordering/relocating the existing code.

This is a good start and good enough for many purposes, especially if you were not able to completely "layer" the functionality so that your "HW module" after all has some common logic which needs to be the same in the SW version. Then doing it as you show gives you the same swappability without you either copy-pasting some common parts (yuck, much worse than what you show), or doing a rewrite with yet one more layer added (which might have performance implications, too, and too much layering isn't always very easy to read/maintain either).
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf