Of stack overflows on embedded systems

This last week has seen some intensive time investigating a random, pretty much non-replicatable bug on an embedded system (RL78/G14). It has maybe been seen a few times before and dismissed. It has now been seen a few more times and presented in such a way that it has stuck around when occurring.

This fault caused quite a few hours of staring at code trying to find an edge case of where it had gone wrong and got stuck in a bad state in it’s state machine. A *LOT* of staring. And then some more.

This code was run pretty much all the time, it had a common function and duplicated logic for two states.

Nothing was suggesting an actual software bug, nothing was suggesting a compiler bug (of which we’ve had a few on this project!).

Soooo… I had noted as a risk earlier that we had used 76% of ROM and 52% of RAM on this chip. Sounds like a bit left, but then we only had 5.4KB of RAM. That means only ~2000 bytes of Stack space left. That could be as low as 15 function calls deep or as high as 30. Neither is a big number.

Investigating the generated map file and I noted that the data for the area we had seen most issues was within the top area of working memory (not the first thing though). I also noted, as I had kind of already knew, that the Stack was allocated from top of RAM and if we went too far we would trash the working memory.

This isn’t your stack overflow of security type stuff, this is literally you have run out of stack space and wrote over your working variables. Why does it happen on such a random basis, well, perhaps we are operating with 1 or 2 levels spare at the very deepest calls which happen only under certain logic and only for short periods. But if you get an interrupt (or even an interrupt within that one) it could well be the straw that breaks the camels back.

I guess, I got complacent, on big systems the MMU would have saved us (well, it would have crashed us) if we even managed to get there. Most of the time your run out of ROM before you run out of RAM.

But it remains, what can we do about it. Worse still, all we have is a lot of evidence pointing to this as the cause of an error, but it’s not repeatable via steps or timing.

The obvious answer is get a bigger chip, and it’s pretty likely this will happen, however our pipeline means we may have to work with what we have for a period.

We can reduce RAM/Stack usage. I’ve looked at this a few times and there isn’t much in the way of reduction without removing functionality.

The next thought is that if this happens so rarely that we have seen maybe 4 or 5 of these in normal usage in 6 months then perhaps we can fix up the routine to break the state machine deadlock. Except, if we trashed the working memory we can’t be sure that everything else is actually OK. It’s a house of cards, once that has occurred we can’t trust anything.

So, I’m left with trying to detect and rectify on the run. 

Step 1 – move the stack below the working memory

On this platform the RAM starts at a non-zero address and attempting to write at addresses below this will cause a reset with an illegal memory access. This gets us back to safe conditions and also allows us to log the reset cause – we should get an idea of just how frequently this is actually happening.

So luckily this build tool chain using LD the GNU linker. Unfortunately, it turns out you can’t figure out the size of a section without defining it.

This matters because we want:

  • 0xF0000 – end of stack
  • 0xF???? – start of stack
  • 0xF???? + 1 – start of working RAM
  • 0xFFFFF – end of working RAM

In other words, we want to allocate the working memory based on it’s size from the end of RAM.

A long story short here, the linker allocation pointer can only be moved forwards, so you need to know the .data and .bss data sizes, then allocate .stack, then allocate the .data and .bss sections.

My first guess was to list a fake RAM duplicate section, allocate a fake .data and .bss (named .fdata and .fbss since I was feeling creative) with a NOLOAD attribute.

I now had the size, except when I now allocated the actual .data and .bss their size was zero.

Turns out the linker is clever, it could see I had already included the objects for .data, .bss and COMMON into the two fake sections and so didn’t/wouldn’t include them again.

As far as I can see there is no trick to doing this, I tried DSECT, KEEP and a bunch of other attributes but it would still not include them.

I then tried calculating the sizes they needed to be and moving the allocation pointer (the ‘.’ ) on by the needed sizes. That worked.

Well, it worked for making the right size, but I kept getting a reset and debugging showed what I believe to be the symbol table not being relocated, so it wasn’t actually correct. Probably because all those symbols were in the discarded (NOLOAD) section of .fdata and .fbss.

I admit defeat at this point, I can’t see a way to automatically size stack and working RAM in this way without using some clever stuff such as double link step to get the section size, extract, and relink again with the non-fake block.

Step 1 – move the stack below working memory (again!)

Step 1, attempt 1 failed, but I should at least be able to statically size the stack and do this right?

Yes, this is actually pretty easy:

stack_size = 0x900;
    .stack 0xFE900 (NOLOAD) : 
    {
        . += stack_size;
        _stack = .;
    } > RAM
    .data : AT(_mdata)
    {
        . = ALIGN(2);
        _data = .;
        *(.data)
        *(.data.*)
        . = ALIGN(2);
        _edata = .;
    } > RAM
    PROVIDE(__romdatacopysize = SIZEOF(.data));
    .bss :
    {
        . = ALIGN(2);
        _bss = .;
        *(.bss)
        (.bss.*)
        . = ALIGN(2);
        *(COMMON)
        . = ALIGN(2);
        _ebss = .;
        _end = .;
        ASSERT((_end < 0xFFEDC),"Not enough space in RAM for working Memory");
    } > RAM

The only things to note here are:

  • You need to define the max stack size, that has to take into account the memory needed for the working RAM. That’s pretty much trial and error.
  • You need to ensure the ‘AT (_mdata)’ (in my case) was used to get the right size and linkage.
  • You really want to include that ASSERT at the end to know when it goes wrong.

I don’t like this solution, it’s an extra step that the customer will need to know about.

Summary

We did manage to get our stack to go from below working memory and a quick test with a stack eating function showed that we do get our reset as planned.

I think the vendor of our toolset and platform would be advised to provide this as a default linker script, with the stack maybe at 50% or maybe even 90% to force the developer to examine the settings early.

The result of this change would be that when developing a stack overflow (is this really an underflow since it was going down?) would be immediately obvious.

The RL78/G14 is not exactly a hobby micro, so an unfriendly stack setting of 90% and placed below working RAM would seem reasonable to me, a stack overflow can have unpredictable effects otherwise.

We still aren’t convinced this was our problem, but it was potentially A problem, so we’ve fixed it. The good side is that I had an interesting issue to diagnose and fix and a challenging fight with a new tool I hadn’t paid much attention to before now.

Lastly, googling showed that this was a common enough issue and I’d single out this blog post and this one from Embedded Gurus for confirming what I suspected and providing a bit of consoling views that it was not just me that had found this one of the harder bugs to verify.

0 Responses to “Of stack overflows on embedded systems”


Comments are currently closed.