This is only a partial response, so it may be disallowed:
Currently I am simply saying every ARM instructions cost 4 cycles and
THUMB instructions cost 2 cycles.
Actually, if I'm remembering ARM correctly, what's actually happening is that, in a single memory cycle, you can either fetch a single 32-bit ARM instruction, or two 16-bit THUMB instructions. Execution time is, of course, instruction-dependent.
Does instruction cycles vary depending on
which section of the memory it's currently accessing to?
http://nocash.emubase.de/gbatek.htm#cpuinstructioncycletimes According
to the above specification, it says different memory areas have
different waitstates but I don't know what it exactly mean.
Different regions of memory may have more data clients than just the CPU. This is particularly true in graphics/video memory where pixels need to be pulled out of RAM to send off to the display. Unless you use faster (read: more expensive) memory, the CPU and graphics hardware will have to take turns accessing RAM. Thus, depending on how many other RAM clients there are, it may take several cycles before the CPU's turn at RAM arrives.
It may not even be this noble -- the manufacturer may have simply decided that they could get away with slower (read: cheaper) RAM for most of program and data space, and used a smaller block of fast memory only for those things that absolutely needed it.
Furthermore, what are Non-sequential cycle, Sequential cycle, Internal
Cycle, Coprocessor Cycle for?
Sequential versus non-sequential refers to reading memory locations in order versus out-of-order. RAM is arranged as a two-dimensional array of memory cells indexed by row and column address. If you access memory locations in ascending address order (e.g. 0, 1, 2, 3, 4, 5, etc.), then all you need to change (most of the time) is the column address, and each successive read can be satisfied more quickly than if you accessed addresses non-sequentially. Judging from the doc you linked to, it looks like you get a CAS increment for free, but a full RAS+CAS costs an extra cycle.
Internal Cycle means the CPU is going, "Wow, this is complicated; gimme some extra time to figure this out." (The MUL instructions are sharp examples of this.)
Coprocessor Cycle refers to the time required to talk to an ARM coprocessor. I don't know what, if any, ARM coprocessors are in the GBA.
As for implementation: That doesn't lend itself to a single answer. If it were me, I'd model RAM access times independently from instruction execution times. When the CPU reads/writes memory, compare the memory address accessed with the previous memory address the CPU accessed. If they're adjacent (M(current) == M(previous) + 4), then it's free; otherwise add a one-cycle penalty. Then add the delays imposed by the memory region you're accessing.
Best Answer
BIOSes used to be written exclusively in assembly language, but the transition was made a long time ago to write the majority of the code in some higher level language, and leave written in assembly as few portions of it as possible, preferably only the bootstrapper, (the very first few hundreds of instructions that the CPU jumps to after a start / reset,) and whatever routines deal with specific quirks of the underlying architecture.
BIOSes were already being written primarily in C as early as the early nineties. (I wrote a BIOS in 90% C, 10% assembly in the early nineties.)
What has also helped greatly in this direction is:
C libraries that target a specific architecture and include functions for dealing with peculiarities of that architecture, for example, functions for reading/writing bytes to/from I/O ports of the x86 architecture. Microsoft C has always offered library functions for that kind of stuff.
C compilers that not only target a specific CPU architecture but even offer extensions to the C language that you can use in order to write code which makes use of special CPU features. For example, the x86 architecture supports things known as interrupts, which invoke routines known as interrupt handlers, and it requires them to have special entry/exit instruction sequences. From the very early days, Microsoft C supported special keywords that you could use to mark a function as an interrupt handler, so it could be invoked directly by a CPU interrupt, so you did not have to write any assembly for it.
Nowadays I would assume that most of the BIOS is written in C++, if not in any higher level language.
The vast majority of the code that makes up a BIOS is specific to the underlying hardware, so it does not really need to be portable: it is guaranteed that it will always run on the same type of CPU. The CPU may evolve, but as long as it maintains backwards compatibility with previous versions, it can still run the BIOS unmodified. Plus, you can always recompile the parts of the BIOS written in C to run natively on any new CPU that comes up, if the need arises.
The reason why we write BIOSes in languages of a higher level than assembly is because it is easier to write them this way, not because they really need to be portable.