The external data-bus width doesn't always agree with the processor's internal structure. A well-known example is the old Intel 8088 processor, which was identical to the 16-bit 8086 internally, but had an 8-bit external bus.
Databus width is not a real indicator of the processor's power, though a less wide bus may affect data throughput. The actual power of a processor is determined by the CPU's ALU, for Arithmetic and Logic Unit. 8-bit microcontrollers will have 8-bit ALUs which can process data in the range 0..255. That's enough for text processing: the ASCII character table only needs 7 bits. The ALU can do some basic arithmetic, but for larger numbers you'll need software help. If you want to add 100500 + 120760 then the 8-bit ALU can't do that directly, not even a 16-bit ALU can. So the compiler will split numbers to do separate calculations on the parts, and recombine the result later.
Suppose you have a decimal ALU, which can process numbers up to 3 decimal digits. The compiler will split the 100500 in 100 and 500, and the 120760 into 120 and 760. The CPU can calculate 500 + 760 = 260, plus an overflow of 1. It takes the overflow digit and add that to the 100 + 120, so that the sum is 221. It then recombines the two parts so that you get the final result 221260. This way you can do anything. The three digits were no objection for processing 6 digits numbers, and you can write algorithms for processing 10-digit number or more. Of course the calculation will take longer than with an ALU which can do 10-digit calculations natively, but it can be done.
Any computer can simulate any other computer.
The humble 8-bit processor can do exactly what a supercomputer can, given the necessary resources, and the time. Lots of time :-).
A concrete example are arbitrary precision calculators. Most (software) calculators have something like 15 decimal digits precision; if numbers have more significant digits it will round them and possible switch to mantissa + exponent form to store and process them. But arbitrary precision expand on the example calculation I gave earlier, and they allow to multiply
\$ 44402958666307977706468954613 \times 595247981199845571008922762709 \$
for example, two numbers (they're both prime) which would need a wider databus than my PC's 64-bit. Extreme example: Mathematica gives you \$\pi\$ to 100000 digits in 1/10th of a second. Calculating \$e^{\pi \sqrt{163}}\$ \$^{(1)}\$ to 100000 digits takes about half a second. So, while you would expect working with data wider than the databus to be taxing, it's often not really a problem. For a PC running at 3 GHz this may not be surprising, but microcontrollers get faster as well: an ARM Cortex-M3 may run at speeds greater than 100 MHz, and for the same money you get a 32-bits bus too.
\$^{(1)}\$ About 262537412640768743.99999999999925007259, and it's not a coincidence that it's nearly an integer!
In the CPU it's all heat. It's the changing from 0 to 1 and back (which ultimately is what a computer does) which consumes the energy, because charge has to be moved from one place to another, and it's this current (moving charge) through resistance which causes heat. \$P = I^2 \times R\$
Ideally a computer which doesn't perform any tasks consumes no energy, but there are always tiny charge leaks and in a 1 billion transistor processor like a Pentium that combination of small leaks still causes a lot of power losses.
Best Answer
SMI is generated for hardware faults. As most communication between the CPU and other components is packet-based, faults can be signalled using error packets, and the SMI is then generated internally. A dedicated pin exists if there is an interface that is not packet-based that needs to signal hardware faults that may be recoverable -- typically that means a memory interface.
SCI is an implementation detail of ACPI. The ACPI virtual machine has read and write access to hardware, but no dedicated "enter system management mode" instruction, so there is a system controller that provides a register that will trigger an SCI when written to. The BIOS sets up an appropriate handler and provides wrapper code in ACPI AML for those functions that are handled outside the OS context (mostly, suspend/resume/reset/poweroff). From the OS point of view, suspending the computer is writing a value to a "suspend controller". If the controller is implemented inside the CPU, there is no need to have a physical pin here either.