C Assembly – Real Benefit of Using CDECL Calling Convention

abiassemblyccalling-conventionsx86

So, I'm learning assembly, and I've come to know the ABIs and i got some basics tests working using the cdecl calling convention to use the c's stdlib under nasm.
But I've seen other Calling Conventions (like topspeed/Clarion/JPI/watcom/borland register(delphi), fastcall, etc).
And I wonder, what are the REAL benefits of using cdecl instead of Clarion.
More specifically Pushing instead of using the registers.

This are some of the benefits i imagined please let me know which of these apply.

  1. I've read that cdecl is important because it allows to use variable parameters, but as I see it, I can do the same using registers.
    The problems are to know the count, type and order of the params, but that problem exists on cdecl too. It can be inferred from the format string (in printf) or from the function signature. And if I run out of registers I can push the rest of the params.

  2. I imagine that performance shouldn't be a big thing because cpus manufactures might have optimized everything that they could regardless of the calling convention of mayor use (and this time is cdecl?¿?).
    But if I think of raw operations (and disregard caches and writeback buffers) I think it should be faster to use registers instead of the stack (which should be located in ram right?). I mean, is like "inc eax" vs "add eax, 1" (am I right?).

  3. By pushing i create a memory space that goes off when I ret (just like creating a local variable).
    That might seem handy (to have a local var just by the time of being called).
    But, as most params (in my experience) are used as a read value, and very few times we need them as a variable, to store mutated values, and when you do you are actually dealing with complex structures that are passed as pointers anyway.
    So I don't really see the value on creating a memory space BEFORE actually needing it.

    As I see, if I do need a variable is better to have the option to create it (like in clarion), but if I don't it would be nice to be able to not create it (like you can't in cdecl).

  4. You preserve the registers.
    I can't even think of a positive side. AFAIK the registers are used for intermediary calculations, and as such are volatile in nature. If I think of them as volatile I could push/pop/mov them whenever i need to "hold" the value and only in such cases.
    In that sense I see it as the most efficient use (I only access the ram when i need it).

But if I don't do that, and try to "preserve the registers":

  • callee: I have no certainty whatsoever what the callee code will do with a register (unless is being documented or they adhere to the same calling convention). By hoping it will preserve them, it imposes an artificial constraint on the callee. As the callee has no idea which registers need to be really preserved they'll tend to overpreserve unnecessary registers. (like pusha/popa in x86?) Which sounds really inefficient to me.

  • caller: the caller have no idea what registers will the callee [s]taint?[/s] use, so it'll just preserve all of them. It'll end up with the same inefficient result as before.

I noticed that linux syscalls as well as 8086 (from my old classes) use registers instead of stack to pass parameters. What happened there?

So those are my thoughts, thanks for all the clarification possible.

Notes:

  • I am learning, most of this are assumptions based on what i've readed/tried so far. i'll be glad if you correct me as needed (just be nice).
  • I do understand this are all x86 CC, and in 64 there is another (which i'm not familiar with).
  • I am not asking which is the "best", i just want to understand more and clarify my assumptions.
  • I am looking for benefits of the calling convention by itself (by contrasting with others). not by it's side-effects (like compilers and cpus being optimized or its ubiquity)
  • Basically a theoretical question to understand more of why that way was chosen

Best Answer

Raymond Chen put together a history of calling conventions here. While he doesn't touch on Clarion, he does touch on Fastcall, which while not the same as Clarion does use more of a register-based approach.

He has this to say:

Fastcall (__fastcall)

The Fastcall calling convention passes the first parameter in the DX register and the second in the CX register (I think). Whether this was actually faster depended on your call usage. It was generally faster since parameters passed in registers do not need to be spilled to the stack, then reloaded by the callee. On the other hand, if significant computation occurs between the computation of the first and second parameters, the caller has to spill it anyway. To add insult to injury, the called function often spilled the register into memory because it needed to spare the register for something else, which in the "significant computation between the first two parameters" case means that you get a double-spill. Ouch!

Consequently, __fastcall was typically faster only for short leaf functions, and even then it might not be.

I believe the criticism applied here is still relevant - Clarion is likely faster for certain types of calls, but not others.

That being said, your points about the register usage are quite valid. While you did not want to consider x64 in the scope of your question, the pattern discussed later in that series for Itanium might interest you!