Depending on the language, it may not be necessary to use a call stack. Call stacks are only necessary in languages that allow recursion or mutual recursion. If the language does not allow recursion, then only one invocation of any procedure may be active at any moment, and local variables for that procedure may be statically allocated. Such languages do have to make provision for context changes, for interrupt handling, but this still does not require a stack.
Refer to FORTRAN IV (and earlier) and early versions of COBOL for examples of languages that do not require call stacks.
Refer to the Control Data 6600 (and earlier Control Data machines) for an example of a highly-successful early supercomputer that did not provide direct hardware support for call stacks. Refer to the PDP-8 for an example of a very successful early minicomputer that did not support call stacks.
As far as I know, the Burroughs B5000 stack machines were the first machines with hardware call stacks. The B5000 machines were designed from the ground up to run ALGOL, which required recursion. They also had one of the first descriptor-based architectures, which laid groundwork for capability architectures.
As far as I know, it was the PDP-6 (which grew into the DEC-10) that popularized call stack hardware, when the hacker community at MIT took delivery of one and discovered that the PUSHJ (Push Return Address and Jump) operation allowed the decimal print routine to be reduced from 50 instructions to 10.
The most basic function call semantics in a language that allow recursion require capabilities that match nicely with a stack. If that's all you need, then a basic stack is a good, simple match. If you need more than that, then your data structure has to do more.
The best example of needing more that I have encountered is the "continuation", the ability to suspend a computation in the middle, save it as a frozen bubble of state, and fire it off again later, possibly many times. Continuations became popular in the Scheme dialect of LISP, as a way to implement, among other things, error exits. Continuations require the ability to snapshot the current execution environment, and reproduce it later, and a stack is somewhat inconvenient for that.
Abelson & Sussman's "Structure and Interpretation of Computer Programs" goes into some detail on continuations.
My question is this: when it call comes down to machine code, to 1s
and 0s, to assembly instructions, should I be at all concerned that my
class-separated code with variety of small-to-tiny functions generates
too much extra overhead?
MY answer is yes, you should. Not because you have lots of little functions (once upon a time the overhead of calling functions was reasonably significant and you could slow your program down by making a million little calls in loops, but today compilers will inline them for you and what's left is taken care of by the CPU fancy prediction algorithms, so don't worry about that) but because you will introduce the concept of layering too much into your programs when the functionality is too small to sensibly grok in your head. If you have larger components you can be reasonably sure they are not performing the same work over and over, but you can make your program so minutely granular that you may find yourself unable to really understand the call paths, and in that end up with something that barely works (and is barely maintainable).
For example, I worked at a place that showed me a reference project for a web service with 1 method. The project comprised 32 .cs files - for a single web service! I figured this was way too much complexity, even though each part was tiny and easily understood by itself, when it came to describing the overall system, I quickly found myself having to trace through calls just to see what the hell it was doing (there were also too many abstractions involved,as you'd expect). My replacement webservice was 4 .cs files.
i didn't measure performance as I figure it would have been roughly the same all in all, but I can guarantee mine was significantly cheaper to maintain. When everyone talks of programmer time being more important than CPU time, then create complex monsters that cost lots of programmer time in both dev and maintenance you have to wonder that they are making excuses for bad behaviour.
It was said somewhere (quote TBD) that up to 70% of all code is made
up of ASM's MOV instruction - loading CPU registers with proper
variables, not the actual computation being done.
That is what CPUs do though, they move bits from memory to registers, add or subtract them, and then put them back into memory. All computing boils down to pretty much that. Mind you, I once had a very multi-threaded program that spent most of its time context switching (ie saving and restoring register state of threads) than it did working on the thread code. A simple lock in the wrong place truly screwed performance there, and it was such an innocuous bit of code too.
So my advice is : find a sensible middle ground between either extreme that make your code look good to other humans, and test the system to see if it performs well. Use the OS features to make sure its running as you'd expect with CPU, memory, disk and network IO.
Best Answer
JVM and .NET both employ "JIT"'s.
These JITs take their respective byte code (both stack-based) and translate them into machine code for the processor they're on. Thus, the same hardware that is available for C and C++ to use is at the disposal of the JIT, and mostly used pretty well. Some of these lack the power of LLVM for generating code, but they have access to and use the full CPU register set, whether x86, x64 (more registers) or other. When JITing, the stack is analyzed and some stack locations are given only CPU registers, while others are given memory locations (on the thread stack).
This mapping is facilitated by a weakening of these VM's stack model, such that the stack actually can be removed from the translated machine code. The way this works is that their stacks are restricted to providing arguments for operations, and cannot be used like a real stack for dynamically pushing and popping. For example, you cannot write a loop the pushes items on the stack, even if you write a later loop that pops them off. The pushes & pops need to be (locally) balanced. This enables total elimination of the VM's stack at runtime in the translated machine code — because there can be a 1:1 mapping between the VM stack locations and physical resources of CPU registers and (machine code) thread stack locations — there is otherwise no need to manifest any notion of the VM stack. Put another way, the stack in these VM's is not an full blown/traditional data structure like a real "stack" data structure you might find in high level language programming, which can hold an arbitrary amount of data for an arbitrary lifetime.
A register-VM -based JIT would convert register assignments into machine code that accesses the CPU registers directly similar to stack-based JITs. Similarly, some VM registers would be mapped directly and only to CPU registers, while others would be mapped to memory locations (on the thread stack).
There are some implementations of VM's that interpret rather than JIT. For them, a hand optimized primary interpreter loop is probably the best they can do. With hand optimized assembly we can make some improvements in jump tables, some register usage spanning a large interpreter switch statement, and such.
Broadly speaking, it is difficult to use a large number of CPU registers as a stack, because unlike memory, the CPU registers are not addressable — they cannot be indexed like memory (or an array); they can only be named in instructions. You can cache the top few stack items in CPU registers, though while reducing some overhead, would add other, leading to diminishing returns for the greater number of CPU registers used for the VM stack. (Under the covers of the hardware, the CPU's highly parallelized hardware interpreter loop can and does address the registers as an array (sometimes called the register file) using an index number, but this hardware capability is not exposed for the machine code instructions to use.)
Given JIT technology, there's no reason for a register-based VM bytecode to be faster than a stack-based bytecode as the objective with both is to remove the bytecode and replace it with native machine code for that processor.
An interpreter that runs register-based byte code would in some cases execute a smaller number of larger instructions, which could equate to faster interpreter execution, since the basic instruction fetch, decode, and execute loop is one of the main bottlenecks in software interpretation. For simple code sequences, register-based machines typically use larger instruction, e.g. instead of the sequence
Push a; Push b; Add; Pop c
, you might have a simplerAdd c,a,b
instruction. However, such an advantage is not consistent across different workloads, as stacks have advantages in complex expression calculations and also parameter passing (which is a big deal as function calls are numerous).(Sophisticated hardware can see a sequence like
Push a; Push b; Add; Pop c
, and execute that as a single fused instruction, so the apparent disadvantage can also dissipate at the hardware level.)