CBLOCK needs a value, otherwise it will be zero, and the variable won't be in in RAM. Set it to 0x20, which puts it at the start of the General Purpose RAM block.
I can't see anything wrong with your delay routine and it worked OK when I simulated it to make sure, with CBLOCK set to 0x20.
Remove the CLRWDT instruction and disable the WDT.
Put
d1 res 1
at the end of your program if you want to use Olin's preferred technique.
Ok, let's first define some terms and the differences between them.
This is a small program coded into the chip's Flash or PROM. Its purpose is to allow you to install a program from outside into the Flash or other internal storage. Apart from that it is usually completely passive and doesn't affect your running program in any way.
This is a small program coded into the chip's Flash or PROM. It is usually installed using the bootloader originally if one exists, or it can take the place of a bootloader and also perform the functions associated with that. It's main function is to load a program from some storage medium - be that flash, SD card, or whatever - and execute it. It often also provides some IO facilities, such as routines for accessing the console. Loaders also often provide a configuration environment with NVRAM (often just a block of Flash) for storing system settings.
As you can see a loader is far more complex than a bootloader.
So they are just both concerned with getting your program into the right place, be that into Flash or RAM, depending on a) how your program is written, and b) what your system is designed to do. That is the "loading" phase of the program. With the bootloader the "loading" is done once by you when you burn your program into the chip. With the loader it is done every time at reset (if needed).
Then control is transferred to your program - be that in RAM (with the loader) or Flash (with either the loader or the bootloader). From then on everything else that happens is purely down to what your program does.
If you happened to write your program in C, then you will have certain C conventions and C library code in place. One of those conventions is the concept of the .data, .bss, etc. C library code manages those sections for you, copying data from .rodata into .data (or wherever for your system) if you are executing from Flash, blanking .bss, etc. That routine is called "crt0", or C Run-Time stage 0, and is responsible for the initialization of your program and passing control to the main() function.
If you didn't write your program in C - say you wrote it in Assembly, then what happens when control is passed to your program is entirely up to you. You might decide to have some block of RAM set aside for global variables. You may not. It's entirely up to you.
So in general, once control has been passed to your program, what happens then is entirely down to your program.
As for setting things like clocks and such, well, that depends entirely on the chip. Most of them have the basic clock settings stored in flash and are loaded up at power-on before anything else happens. On some chips they form part of the bootloader, on some they are separate, etc. Some chips provide a way of adjusting the clock from software, some don't. If they do, then when that would happen is anyone's guess. The bootloader may set the clock, or maybe a loader, or even your own program might set a specific clock speed.
For the stack pointer, each environment in the boot sequence is a separate system in its own right. Typically the bootloader or loader would set the stack pointer for its own usage. Then when control is passed to your program the stack pointer will be re-initialized by your program to suit its own needs. Once your program executes the bootloader or loader as good as doesn't exist any more. Yes, there may be the ability to call functions based in the loader (like a PC'S BIOS calls) but the loader is no longer running as such.
Best Answer
If you look at a typical 80(C)32 circuit below (from here):
You can see that the 8032 talks to external EPROM, RAM and EEPROM via a bus - 8 bit data and 16 bit address. The latter is latched and thus demultiplexed with the HCT573. There were some chips with the latch built in designed to be used with the 8031/8032 but the above was the more common configuration- using low-cost standard memory chips. There's also a bit of "glue" logic to decode the addresses and to generate the proper signals for the memories (the HCT138 and the quad NAND).
It is vital that the glue logic is designed such that the EPROM resides at address 0 because after a reset, execution always begins by the 8032 fetching the first byte of the instruction from that address. This is a function of hardware in the 8032 and cannot be changed. Typically the instruction is a 3-byte LJMP instruction that jumps to the beginning of the program. We call that the "RESET VECTOR". Other vectors occupy the bytes immediately above the reset vector- for the external interrupt service routine and timer interrupt service routine.
In those days, the EPROM would be programmed (written to) by a separate programmer outside of the circuit and then typically plugged into a socket. No in-circuit programming in those days. The RAM and EEPROM could be written by the micro, but the program would have to be loaded into the EPROM before any of that was possible.