High-Level Overview of How printf() Works in Windows OS

cstlvisual studio 2010windows

I asked this question on an IRC channel, sadly I am going around in circles.

I am aiming for a high level overview (but with some technical details if necessary) on how a function such as printf() from stdio.h "talks" to the Windows operating system.

I know a bit about MSVCRT.dll, bits of the Windows API such as how it communicates with kernel32.dll which goes to the Native API in ntdll.dll. I think VisualStudio uses msvcrt100.dll for C Standard Library…

Somebody recommended using an open source C standard library to understand what happens under the hood. Aside from not knowing how to use one in my Visual Studio project, I would still not know HOW this communicates with the OS.

My terrible understanding which misses several steps is as follows:

1) Syntax of printf() is checked against header stdio.h

2) When program is run it uses msvcrt100 for printf

3) msvcrt100 then loads the necessary Windows library such as kernel32

4) Kernel32 then passes it onto ntdll.dll

This question focuses on C, but if C++ is the same feel free to post.

Windows only.

Best Answer

While the specific details vary between operating systems, you probably want to start somewhat with an understand of

  • an object file format (.obj), created by the compiler/assembler, and
  • an executable file format (.exe), formed by by the linker
  • a dll file format, also formed (and consumed) by the linker

These (disc-based) file formats contain sections for machine code, initial program data values, relocations and a symbol table as well.

Before the code can be executed by the hardware, all symbolically described values (i.e. main, printf) have to be fully resolved to memory locations.

The compiler (and assembler) have incomplete information at compile time about the final memory locations of various pieces of code and data — so, they share this information in the object file symbolically rather than as final memory addresses — this is where the relocations and symbol table come into play.

The symbol table has both imports and exports — the exports gives string names to offsets within sections (code/data) the object file; whereas the imports are associated with just string names (externals for later resolution).

Relocations tell the consumer of the object file how & where to fix up the machine code and data, once the memory addresses of the symbols is known; thus, relocations typically have references to entries in the symbol table (they also reference sections in the object module...).

A .exe file for a program has one special location main that is marked as the entry point in the headers for the executable.  A .exe for a program also has all external references resolved, one way or another — however, the linker that produces the executable file still does not have complete information about the memory addresses of all of the symbols as that usually doesn't happen until load time.  So, the .exe file still has relocations and a symbol table.

The .exe combines all the .obj file's into a single file: the code sections are concatenated, as are the data sections.  As compared with the .obj file, some of the obj file's relocations can be resolved and thus removed from the .exe file; other relocations can have their form simplified (referring to the code or data section rather than referring particular symbols — this makes for a shorter relocation entry).

Typically the operating system loader ultimately determines the locations of code and data sections, and this completes the assignment of memory address to symbols, meaning that any outstanding relocations can now be performed.

A DLL (.dll) file is like a .exe file except that it has specific exports.  The .exe file created by the linker may have references to .dll files.  These reference are read by a dynamic linking loader, so that the operation of loading an .exe file require also loading a .dll file, which may further require additional .dll files to be loaded.  All of them are cross linked to each other as per their relocation entries.

The hardware then executes machine code instructions like call printf, where the reference to printf is specified by its memory location.  There are many approaches, some involving jump tables or code sequences of several instructions; however, suffice it to say that during execution of the machine code, the hardware only sees/knows about memory addresses and not symbol names.  All symbolic references are eventually resolved to memory addresses by the system of relocations and symbol table entries.

printf, when appropriate (e.g. local buffer full), will ultimately invoke a system call of some sort to perform i/o such as flushing the local buffer to disc or device.  A system call is a way for a user process to request an operation of the operating system.  System calls similar to regular calls except that the caller is in the user process and the callee is an operating system entry point.  System calls provide a controlled method of raising the privilege (from user to kernel) as needed to provide access shared devices.  System calls don't use memory address, but instead each system call is associated with a simple integer index — this allows the .exe's and .dll's that call into the operating system to remain unaware of kernel memory addresses.

Related Topic