This is only a partial response, so it may be disallowed:
Currently I am simply saying every ARM instructions cost 4 cycles and
THUMB instructions cost 2 cycles.
Actually, if I'm remembering ARM correctly, what's actually happening is that, in a single memory cycle, you can either fetch a single 32-bit ARM instruction, or two 16-bit THUMB instructions. Execution time is, of course, instruction-dependent.
Does instruction cycles vary depending on
which section of the memory it's currently accessing to?
http://nocash.emubase.de/gbatek.htm#cpuinstructioncycletimes According
to the above specification, it says different memory areas have
different waitstates but I don't know what it exactly mean.
Different regions of memory may have more data clients than just the CPU. This is particularly true in graphics/video memory where pixels need to be pulled out of RAM to send off to the display. Unless you use faster (read: more expensive) memory, the CPU and graphics hardware will have to take turns accessing RAM. Thus, depending on how many other RAM clients there are, it may take several cycles before the CPU's turn at RAM arrives.
It may not even be this noble -- the manufacturer may have simply decided that they could get away with slower (read: cheaper) RAM for most of program and data space, and used a smaller block of fast memory only for those things that absolutely needed it.
Furthermore, what are Non-sequential cycle, Sequential cycle, Internal
Cycle, Coprocessor Cycle for?
Sequential versus non-sequential refers to reading memory locations in order versus out-of-order. RAM is arranged as a two-dimensional array of memory cells indexed by row and column address. If you access memory locations in ascending address order (e.g. 0, 1, 2, 3, 4, 5, etc.), then all you need to change (most of the time) is the column address, and each successive read can be satisfied more quickly than if you accessed addresses non-sequentially. Judging from the doc you linked to, it looks like you get a CAS increment for free, but a full RAS+CAS costs an extra cycle.
Internal Cycle means the CPU is going, "Wow, this is complicated; gimme some extra time to figure this out." (The MUL instructions are sharp examples of this.)
Coprocessor Cycle refers to the time required to talk to an ARM coprocessor. I don't know what, if any, ARM coprocessors are in the GBA.
As for implementation: That doesn't lend itself to a single answer. If it were me, I'd model RAM access times independently from instruction execution times. When the CPU reads/writes memory, compare the memory address accessed with the previous memory address the CPU accessed. If they're adjacent (M(current) == M(previous) + 4), then it's free; otherwise add a one-cycle penalty. Then add the delays imposed by the memory region you're accessing.
If there is no dynamic dispatch (polymorphism), "methods" are just sugary functions, perhaps with an implicit additional parameter. Accordingly, instances of classes with no polymorphic behavior are essentially C struct
s for the purpose of code generation.
For classical dynamic dispatch in a static type system, there is basically one predominant strategy: vtables. Every instance gets one additional pointer that refers to (a limited representation of) its type, most importantly the vtable: An array of function pointers, one per method. Since the the full set of methods for every type (in the inheritance chain) is known at compile time, one can assign consecutive indices (0..N for N methods) to the methods and invoke the methods by looking up the function pointer in the vtable using this index (again passing the instance reference as additional parameter).
For more dynamic class-based languages, typically classes themselves are first-class objects and each object instead has a reference to its class object. The class object, in turn, owns the methods in some language-dependent manner (in Ruby, methods are a core part of the object model, in Python they're just function objects with tiny wrappers around them). The classes typically store references to their superclass(es) as well, and delegate the search for inherited methods to those classes to aid metaprogramming which adds and alters methods.
There are many other systems that aren't based on classes, but they differ significantly, so I'll only pick out one interesting design alternative: When you can add new (sets of) methods to all types at will anywhere in the program (e.g. type classes in Haskell and traits in Rust), the full set of methods isn't known while compiling. To resolve this, one creates a vtable per trait and passes them around when the trait implementation is required. That is, code like this:
void needs_a_trait(SomeTrait &x) { x.method2(1); }
ConcreteType x = ...;
needs_a_trait(x);
is compiled down to this:
functionpointer SomeTrait_ConcreteType_vtable[] = { &method1, &method2, ... };
void needs_a_trait(void *x, functionpointer vtable[]) { vtable[1](x, 1); }
ConcreteType x = ...;
needs_a_trait(x, SomeTrait_ConcreteType_vtable);
This also means the vtable information isn't embedded in the object. If you want references to an "instance of a trait" that will behave correctly when, for example, stored in data structures that contain many different types, one can create a fat pointer (instance_pointer, trait_vtable)
. This is actually a generalization of the above strategy.
Best Answer
In most cases, yes, cycle time for each stage is fixed. There are some exceptions, depending on processor. But the description you give is vastly over-simplified. Modern processors are organised in pipelines, so that one stage of execution of one instruction can occur at the same time as others. While some processors use a 6-stage pipeline like you describe, they are a small minority. Most modern processors split the operation into many more stages, each of which takes once cycle. For example, Intel Core processors of the current generation have 19 stages, each of which takes a single cycle. In some circumstances an instruction may skip one of them. Usually, multiple instructions are executed in different stages simultaneously, but some instructions in some circumstances will prevent other operations progressing (e.g. branch mispredictions, or if no instructions are ready because they need to wait for data that has not been produced yet). Also, the processor core may have multiple pipelines so multiple instructions run completely in parallel, and in some architectures not all pipelines are capable of execution of all instruction types. Instruction fetch and decoding is shared between all pipelines, and in many cases can handle many instructions per cycle. In modern processors based on CISC instructions like Intel x86 the instructions are translated into RISC-like micro instructions before execution, so one program instruction may translate to multiple instructions in the pipeline (or vice versa). Determining the actual performance in real world situations is extremely difficult.