It would be pretty hard to design a modern microcontroller from the transistor level. Transistors are used to make logic gates, as the lowest level building blocks. The most simple gate is the NOT gate, which inverts the input level: a logic 0 becomes a 1, and vice versa. The NOT gate is built with 2 transistors:
The top transistor conducts if its input (the small dot) is low, the bottom transistor (no dot) if its input is high. So if you apply a high input the bottom transistor conducts, acting as a switch, and switch the output Q to Vss (that's your -), or low level. We've inverted the input. Other gates are based on that. Take the NAND gate for instance. It has two or more inputs, and the output is low if all inputs are high. In all other cases the input is high.
This is a 2-input NAND gate, you still can see some of the inverter in it. If both inputs are high the bottom transistors both conduct, and the output is made low through its connection with Vss. If either of the inputs is low at least one of the top transistors conducts and output will be made high through its connection with Vdd (that's your +). We've got a working NAND gate. And so it goes on, we can build an XOR gate using 4 NAND gates, and other more complex logic is built from a combination of building blocks. You need a building block to create a register function: a part which can hold its state, like a memory.
If an instruction wants to read from RAM there's first the instruction decoder. Through a combination of gates it derives a number of signals it needs to execute the instruction. One of those will be to pass the RAM address to be read to the adress bus. (Busses are channels of signals, an address bus for an 8-bit controller may for instance have 16 parallel wires. Busses are a way to get things organized. Without them the controller's design would become very inefficient.) Another signal will activate the RD line (for "read"), which signals the RAM that it should place the data on the databus. Yet another signal will latch that data in a register.
An important thing is timing. You can't latch the data if you haven't given the RAM the time to place it on the bus. All done by the same logic, from building blocks which in turn are built from transistors.
"transistors somehow increase computation speed"
There's nothing else than transistors and their connections. Transistors by themselves won't increase speed. What is true, however, is that technological improvements allow for faster transistors, and faster transistors means faster computation. Some of these improvements are unique technology steps, which you can do only once. But scaling is another factor, and they've kept repeating that since the first ICs were designed. 40 years ago an IC would typically have a 5 to 8 \$\mu\$m feature size. Today we can produce down to 22nm feature size. (DNA has a 15nm diameter.) With smaller feature size the physical properties of the transistor change allowing it to switch faster. Moore saw in this continuing scaling a trend, which became known as Moore's Law. This can't go on forever without having to take a leap in the used technology. In the 80s it was feared that the then used technology would have its limits at around 200nm to 300nm, because the lithographic process may not work at wavelengths below visible light. But lithography is still the technology used by today's steppers.
Your basic synchronous digital design is a sort of discrete time feedback loop.
One or more registers made of flip-flop like storage elements hold values. These flow onto buses, possibly through path selectors called multiplexers and various combinatorial logic circuits which perform a logical or arithmetic operation on the value represented by some number of signals (typically each signal being a binary digit, or bit). Over a short period of time, the result flowing through the paths and logic to circle back to the inputs of the registers stabilizes. After an interval calculated to safely allow the worst case time needed for this, another active clock edge occurs, which causes one or more of the registers to replace the current values they have been holding with the updated values being fed back to their inputs.
In the case of something like a (synchronous) a counter, the logic circuit sitting between the outputs of the registers and the inputs would add one, so each active clock edge would see the stored value increase by one.
More complicated operations such as performed by a CPU might select two source numbers from a register array from which two locations can be simultaneously accessed through two separate read ports, add them together, and write them back to somewhere in that same register array through a third writing port.
Best Answer
A transistor (FET, in modern ICs) never switches instantly from full OFF to full ON. There is a period while it's turning on or off where the FET acts like a resistor (even when fully ON it still has a resistance).
As you know, passing a current through a resistor generates heat (\$P=I^2R\$ or \$P=\frac{V^2}{R}\$).
The more the transistors switch the more time they spend in that resistive state, so the more heat they generate. So the amount of heat generated can be directly proportional to the number of transistors - but it is also dependent on which transistors are doing what and when, and that depends on what the chip is being instructed to do.
Yes, manufacturers may position specific blocks of their design (not individual transistors, but blocks that form a complete function) in certain areas depending on the heat that block could generate - either to place it in a location with better heat bonding, or to place it away from another block that may generate heat. They also have to take into account power distribution within the chip, so placing blocks arbitrarily may not always be possible, so they have to come to a compromise.