What makes the stack concept useful in embedded programming

dataembeddedstack

Below is an illustration of the stack concept:

Enter image description here

I have read many times about the stack pointer and how some variables are stored in SRAM on a stack.

Many tutorials explain how it works, but not why this concept exist at all.

Imagine I am the program and I want to store my cup, my watch and my hat in my room. Why would I store them on top of each other instead of putting them in random places in my room?

Program can reach the data as long as it has its address. But if the stack pointer puts x on the top y. How will it reach y without removing x then? So to me a stack must make things (data access) even harder.

Can you give a very simple example with registers or assembly code or C code to illustrate the benefit of this stacking concept, namely "Last in, first out" (LIFO)?

Best Answer

Conclusion

Yes, you want stacks in embedded programming. It's a good idea that has emerged from long experience.

Short History

I'm old enough to have worked on computers that didn't possess hardware stack support. (You can always manufacture your own stacks in software, obviously, if you have the necessary instructions for some kind of indirect memory references, anyway.) So perhaps I have a few things to say about the topic.

The HP 21xx series processor family is a good example. Back in the day this processor family was commonly used in dual-processor configurations by school systems as a time-sharing system for administrative and/or student use. The last software edition using the HP 21xx processor family was the HP 2000F Timesharing System. (They switched to the HP 21MX going from "F" to "G".)

The instruction set did not support stacks for making a subroutine call. Instead, it poked the first memory location with the return address and started execution of the routine at the next address. At the end of the subroutine, there would be an indirect jump instruction that would reference that first memory location and jump back to the caller, just after the call, in that fashion.

Here's some actual code from the BASIC operating system (note the UPPER CASE being used -- lower case wasn't always available):

STACH NOP
      STB LTEMP+8      SAVE B-REG.
      AND B177         CLEAR TOP BITS.
      LDB STABF        GET BUFFER POINTER.
      CLE,ERB
      SEZ,RSS
      ALF,SLA,ALF      CHARACTER ON LEFT.
      IOR 1,I          CHARACTER ON RIGHT.
      STA 1,I
      ISZ STABF        BUMP POINTER.
      LDA LTEMP+8      RESTORE OLD B-REG.INTO A.
      JMP STACH,I

The last line is the return from subroutine instruction. Note that a NOP is always used as the first instruction of a subroutine. That's because it is always "blown away" and replaced by the caller's return address when the subroutine is called.

The RSS instruction is a modifier that "reverses the skip sense" of whatever it modifies. In this case, a "Skip next instruction if Z=0" instruction (SEZ.) So RSS reverses that sense and changes it to a "Skip next instruction if Z!=0" instruction.

Here's a nearby example that actually has to play with the return address because it wants a different routine to directly return to this routine's caller:

STAPR NOP
      LDA STAPR        SAVE RETURN ADDRESS.
      STA T35SP
      LDA T35B2        COMPUTE # OF CHARS.
      CMA,INA
      ADA STABF
      LDB T35B2        RESET BUFFER
      STB STABF         POINTER.
      LDB T35B1
      JMP T35SP+1      OUTPUT.

Note that the last instruction jumps to T35SP+1. That's one instruction after T35SP. (T35SP is yet another function.) Note also that this routine had to first copy its own return address and stuff it into that first address of this routine they will at the end be jumping to. This allows the routine they jump to (which uses a JMP,I instruction to return) to return to its caller, directly. There's no stack, so this is kind of how it was done back then.

(The CMA,INA is the same thing as NEG A. It just means complement A and then increment A. You could do one, the other, or both in a single instruction.)

Note also that the assembler did NOT support local labels. All labels were completely global. This means those HP 21xx programmers were continually having to come up with new, slightly modified names for things to avoid conflicts. And no lower-case!

Yup. You got it. Times were hard back then.

Of course, an entire BASIC interpreter code providing time-sharing capabilities for 32 simultaneous users and as well as all the usual transcendental functions, matrices, the usual matrix operations that included matrix inversion and matrix determinant operators would fit completely inside of 6k-word (16-bit words.) Yes -- The whole thing.

Now ask yourself, "What could I do with only 12k-byte of code space to work with?"

Moving on,

An immediately obvious problem with this arrangement is that you cannot call any other routines that might need to call the routine you are currently in because, if that were to happen, you'd lose the original caller's return address. And you certainly cannot recursively call yourself!

So if that's a requirement, you have to write a bunch of code in the routine to copy the address to some "software stack" that you create. And you have to do that everywhere it may be an issue.

It can become a serious pain. Been there, done that.

During the '60's, there was a learning process going on about useful mechanisms.

There's a step-wise process of learning that's sometimes called 2nd system syndrome. Or, at least, among my peers at the time. This is where the 1st system just fails to meet the necessary targets in a variety of ways and there's a strong push to use what's been learned in order to create a brighter 2nd system. But the 2nd system goes over-board with features and, as a result, consumers complain that they can't find the main things they really need as they are buried among so many different options. So, after some feature-trimming, right about on the 3rd system attempt then the design is "about right." (I suppose this could also be the story of Goldilocks and the three bears, where the last bed tried was "just right.")

When the PDP-11 was designed, it was a "2nd system" approach. They provided every possible way of having hardware support stacks. There is nothing like it, today. Nothing quite that good. You should get a chance to see the instruction set. It's a marvel in ways to call other routines. You can call code and have your return address placed in a register. This is great for fast co-routines. You can call code and use almost any other register as a stack pointer for the return address. Not stuck with just one. You want two hardware stacks? No problem. Three? That's okay, too. Every register can be used for: Register, Register Indirect, Auto-increment Pointer, Auto-decrement Pointer, Auto-increment Pointer Indirect, Auto-decrement Pointer Indirect, Indexed Pointer, and Indexed Pointer Indirect.

The PDP-11 was stack-heaven. Life seemed good.

But then... the 3rd system syndrome hit. Now, we are stuck with a more "balanced" view of instruction sets.

Oh, well.

The Stack Frame aka Activation Frame

A stack isn't just a stack for holding return addresses. It's also an important tool for compiling code from source languages and for assembly programmers who want an easier life than having to keep track of stuff strewn about the room, in random locations.

I drew this up a long time ago:

enter image description here

The above, shown in the whiter region, is a typical activation frame.

The above diagram does not define all possible variations. Only one of the simpler ones. For example, Pascal requires support for nested functions and these nested routines must have access to local variables in their enclosing code bodies, so the activation frames for compiled Pascal code may be a little more complicated than what's above.

The caller pushes parameters onto "a [thread] stack" and then makes a call, which automatically pushes the return address. Note that in the above case there is an "optional" return value parameter. This is because sometimes the return value might be a vector instead of something that can easily fit in a register. (Registers are often used as return values, when the data result readily fits -- such as integers.) So the called routine can use this slot to stuff the vector return value, if that's appropriate. But it is optional in the sense that each circumstance determines if it is needed, or not.

Once the called routine starts, the first thing it does (using prologue code) is save special registers that must be preserved across calls. These registers are ones that the calling routine rightly assumes won't be destroyed when it calls the subroutine. They were sometimes called "preserve registers," but I've no idea what today's computer scientists call them. The point is if the called subroutine needs to play with those registers, then it must save them so that it can return them, unmolested, to the caller.

Separately, a different register called the "frame pointer" or "activation frame pointer" must also be preserved. This is just one added register that is reserved for this special activation frame purpose by a compiler (or assembly coder.) The activation frame pointer always points at the "current frame" on the stack and it provides the currently executing subroutine with access to everything it needs to get its job done.

(The order in which the above preservation takes place isn't terribly important. There are pluses and minuses, regardless of the choice.)

After the necessary prologue is over, the subroutine gets about its business. Once it is over, the subroutine is now required to restore the preserved registers, restore the activation frame pointer and then return to its caller. (The epilogue code.)

This allows a subroutine to have its own local variables that are local to it, even in the face of recursion (calling itself.)

Let's take two un-optimized C routines and the old 16-bit x86 instruction set to make some examples for clarity.

Suppose:

int f1( int a ) {
    if ( a > 2 ) return a * f1( a-1 );
  return a;
}

int f2( int a ) {
  int r= a;
    if ( a > 2 ) r= a * f2( a-1 );
  return r;
}

I've used recursive forms of computing a factorial from a supplied value. I don't want to focus on the quality or clarity of the above code. That's not relevant. I just want to focus on how this might be compiled, and why. Also, since it has been so long since I regularly coded x86 routines, forgive any "pigeonness" of my coding. It's possible I will have forgotten a syntax detail.

       ; Caller expects the function result in AX.
f1:    push si               ; save SI because caller assumes it is preserved.
       push bp               ; save caller's activation frame pointer.
       mov  bp, sp           ; set up our own activation frame pointer.
       mov  ax, [bp+6]       ; fetch parameter value into AX.
                             ;    the +6 part is there to skip the saved BP, SI,
                             ;    and the caller's return address.
       cmp  ax, #2           ; compare it against 2.
       ble  t1               ; exit the routine if <= 2, AX is already set.
       mov  si, ax           ; save parameter value in SI where it is safe.
       dec  ax               ; subtract 1 from the given parameter value.
       push ax               ; push this value as a parameter to a call.
       call f1               ; now call ourselves to compute its factorial.
       imul si               ; multiply this result (in AX) by the saved parameter.
                             ;    the result will be in DX:AX, but we ignore DX.
t1:    pop  bp               ; restore caller's activation frame pointer.           
       pop  si               ; restore SI.
       ret                   ; result is already in AX so just return.


       ; Caller expects the function result in AX.
f2:    push bp               ; save caller's activation frame pointer.
       mov  bp, sp           ; set up our own activation frame pointer.
       sub  sp, #2           ; reserve 2 bytes for local variable 'r'.
       mov  ax, [bp+4]       ; fetch parameter value into AX.
                             ;    the +4 part is there to skip the saved BP
                             ;    and the caller's return address.
       mov  [bp-2], ax       ; set local variable 'r' to parameter 'a'.
       cmp  ax, #2           ; compare 'a' against 2.
       ble  t2               ; exit the routine if <= 2, result 'r' is already set.
       dec  ax               ; subtract 1 from the given parameter value.
       push ax               ; push this value as a parameter to a call.
       call f2               ; now call ourselves to compute its factorial.
       imul [bp-2]           ; multiply this result (in AX) by 'r'.
       mov  [bp-2], ax       ; save result in 'r' (ignore DX.)
t2:    mov  ax, [bp-2]       ; put 'r' into the function result register.
       pop  bp               ; restore caller's activation frame pointer.           
       ret                   ; just return.

These are intentionally left unoptimized in order to maintain coherency with the source code. A good optimizer would have a field day with the above code.

I wanted to provide two different cases: the first highlights preserving a register that a caller expects not to be changed across the call and the second highlights how local variables are allocated using the stack. In the second case, note how easy it is to allocate -- just subtract a number from the stack pointer. You can subtract a much larger number, still with a single instruction, to allocate vast amounts of local variable storage. It's truly 'that easy.' (Note also that these local variables are not initialized!)

If you work through a few examples like this, you'll see many benefits to the concept of a stack.

I've not described something in the picture. There's an optional exception handler pointer. This is quite useful when handling exceptions. Normally, this can be NULL when this routine doesn't include a Try-block handler. But if it does, then this value is set to the code that will handle the error. There is a process called a "stack unwind" (back in my day) where an exception that occurs in one routine that does not have a handler, will slowly unwind the stack backwards to prior callers until it finds one that has a handler installed. This action "blows away" the context of all called subroutines beneath the routine that carries a handler for errors. It's a convenient device. But beyond the context I wanted to add here.

A final note. The order and manner in which the preserved registers and the activation frame pointer are pushed onto the stack and restored from it isn't a vital issue. Different folks will choose different approaches here. It's just important to know what choice is actually made, because it impacts the offsets used relative to the activation frame pointer when accessing either parameter values or accessing local variables. The activation frame pointer is used, though, for all those accesses since the parameters and the local variables are all relative to the activation frame pointer. The exact positions relative to it are not important. But it is important that their relative positions are fixed and that the compiler or assembly coder knows the offsets to use and doesn't need to perform some run-time calculations to work them out.

Conclusion?

Yes, you want stacks in embedded programming. It's a good idea that has emerged from long experience.

(And you certainly do not want the HP 21xx instruction set implemented on your MCU if it doesn't implement FeRAM or magnetic core memory -- as it requires writable code space!)