Write down Kirchoff's voltage law for the combined loop. That is, disregard the current source and any elements in series with it, since they're not part of the loop.
The relation between the two partial currents is then:
$$I_d - I_u = I_2$$
where \$I_d\$ is the clockwise current in the lower mesh, and \$I_u\$ is the clockwise current in the upper mesh.
When you "turn off" a current source, you are essentially inserting a current source of 0A. This is an open circuit, as the open circuit works as an infinitely large resistance, therefore no current is going to flow there. You can simply ignore those wires. The fact they left them in is probably meant to make it visually easier to understand.
In figure (c), there are indeed 3 nodes: one in the middle (v1, common node of the two 4\$\Omega\$ and the 3\$\Omega\$ resistor), one on "top" of the current source (v2), and the one connected to the "bottom" of the source. They decided to make it the reference node, valued 0V for easier calculations. They could have chosen any other node for this, but generally you tend to choose the one on the bottom of your schematic. In most cases, this provides you the easiest solution.
Regarding your question on how they solved (c): From the top of your source, there is 3A flowing out. That has two ways to go from there: through the 8\$\Omega\$ resistor or through the other branch. The sum of these two, obviously, is 3A as well.
The solution ignores the fact that the second branch also divides later on, because the current flowing through the rightmost 4\$\Omega\$ resistor equals to the sum of currents flowing through the lefmost 4\$\Omega\$ resistor and the 3\$\Omega\$ one. If you look carefully, the voltage drop v2-v1 is through the rightmost, not the leftmost 4\$\Omega\$ resistor.
And lastly, you can choose whichever type of analysis you want, with a little practice you're going to feel which one is easier in a given problem.
Best Answer
In theory, you could do it in constant time, but only until you run out of resources. Let's just consider hardware multiplies for now, since they will likely constrain the design. For example, the largest Xilinx Spartan-6 (the value line) contains 180 multipliers; the largest Virtex-6 contains 2128 multipliers (and will probably cost tens of thousands). These are 18-bit multipliers, but for the sake of argument treat them as abstractions. The number of multipliers then gives you the number of multiplications you can do at one time. If I understand the problem right, the square root of that number gives you the dataset that can run in one clock cycle.
In practice there are add/subtractions to worry about, the required precision will lower the amount of multiplies you can do, and the connections you'll need to make across the FPGA fabric will be very dense. All of these factors lower the maximum speed you can clock the FPGA at. Plus you have to get the data in and out of the FPGA. Thus my gut feeling is that this is not a 'killer app' for FPGAs. Conventional processors and GPUs are probably a better bet. (See R+GPU.)