An ideal digital signal has infinitely steep edges. We can compose this signal from sines, one fundamental and a number of harmonics.
Neither of those separate sines has infinite steepness. The only way to get our steep edge is by adding an infinite number of harmonics.
RF bandwidth and data rate are related by the modulation format. Different modulation formats will require different bandwidths for the same data rate. For FM modulation, the bandwidth is approximately 2*(df + fm) where df is the maximum frequency deviation and fm is the frequency of the message. FSK is basically FM where the message signal is a square wave. The highest frequency component of a binary bit sequence transmitted serially occurs when the sequence is 01010101. This component is one half of the bit rate. So for FSK, the bandwidth is approximately Δf + r where Δf is the separation between the two frequencies and r is the bit rate. The reason this is bigger than Δf is because whenever the frequency is changed, extra frequency components are generated. Switching between frequencies more often (higher data rate) results in more power in these extra frequency components. Now, these can be filtered out to some extent, but if you filter more of them than Δf + r, the result will be too distorted to reliably extract the original bitstream.
Think about it this way: a pure sinewave consumes zero bandwidth, but it also contains zero information. As soon as you start changing a characteristic of a pure sinewave (frequency, phase, amplitude, etc.) its bandwidth must increase accordingly. In the case of amplitude modulation, modulating the amplitidue of a sinewave of frequency fc at frequency fm will result in a signal with components at fc, fc+fm, and fc-fm. If the message contains components all the way down to DC, then the resulting modulated signal will have twice the bandwidth of the message signal. FSK is basically transmitting two AM signals at the same time on different frequencies, so the bandwidth will naturally be increased by the separation of these two carrier frequencies.
For FSK, the bit rate and the symbol rate are the same. But for higher order modulations like QPSK and QAM, each transmitted symbol can code for more than one bit so the bit rate can be significantly higher than the symbol rate. This means that the required transmit bandwidth is less than what would be required for AM or FSK. QPSK and QAM have higher spectral efficiency. However, QPSK and QAM are more susceptible to noise and distortion and therefore require a relatively higher SNR.
Also, for FSK, you want the two frequencies to be integer multiples of the data rate. This will result in an integer number of cycles in each bit period so that the carrier always ends up at the same level on data bit transitions. This probably won't be done at RF, though. Generally the FSK signal would be generated at an intermediate frequency which would then be mixed up to the actual RF carrier frequency.
Best Answer
Theoretically a square wave has infinite bandwidth but it still looks reasonably square even if the bandwidth is severely compromised. A square wave is "made from" a series of ever-increasing harmonics. See the picture below to get an understanding: -
On the right is a sinewave then as you look down you'll see that it grows into a square wave. If all we had was the sine wave and the third harmonic we would be able to "decode" this adequately.
This means we can send fairly pure digital data (very good looking waveforms with fast rise and fall times) over a very limited-bandwidth channel and decode them successfully. I had to put this pretty moving picture in: -
http://upload.wikimedia.org/wikipedia/en/thumb/5/50/Square_wave_frequency_spectrum_animation.gif/600px-Square_wave_frequency_spectrum_animation.gif
It shows the gradual evolution of a square wave from a sine wave and along with the spectrum.
ADDED SECTION
It's also worth pointing out that data is hardly ever a perfect square wave; more likely it is a quickly changing pulse waveform so, I'm also showing below the spectrum shape for a generalized pulse: -
For non 50:50 waveforms (i.e. non-square) both odd and even harmonics are generated. I've also shown the triangle spectrum - it is of considerable interest when the digital data is slew rate limited in order to restrict the bandwidth. Compare the spectral content between this and the square wave directly above. Picture taken from here