RF bandwidth and data rate are related by the modulation format. Different modulation formats will require different bandwidths for the same data rate. For FM modulation, the bandwidth is approximately 2*(df + fm) where df is the maximum frequency deviation and fm is the frequency of the message. FSK is basically FM where the message signal is a square wave. The highest frequency component of a binary bit sequence transmitted serially occurs when the sequence is 01010101. This component is one half of the bit rate. So for FSK, the bandwidth is approximately Δf + r where Δf is the separation between the two frequencies and r is the bit rate. The reason this is bigger than Δf is because whenever the frequency is changed, extra frequency components are generated. Switching between frequencies more often (higher data rate) results in more power in these extra frequency components. Now, these can be filtered out to some extent, but if you filter more of them than Δf + r, the result will be too distorted to reliably extract the original bitstream.
Think about it this way: a pure sinewave consumes zero bandwidth, but it also contains zero information. As soon as you start changing a characteristic of a pure sinewave (frequency, phase, amplitude, etc.) its bandwidth must increase accordingly. In the case of amplitude modulation, modulating the amplitidue of a sinewave of frequency fc at frequency fm will result in a signal with components at fc, fc+fm, and fc-fm. If the message contains components all the way down to DC, then the resulting modulated signal will have twice the bandwidth of the message signal. FSK is basically transmitting two AM signals at the same time on different frequencies, so the bandwidth will naturally be increased by the separation of these two carrier frequencies.
For FSK, the bit rate and the symbol rate are the same. But for higher order modulations like QPSK and QAM, each transmitted symbol can code for more than one bit so the bit rate can be significantly higher than the symbol rate. This means that the required transmit bandwidth is less than what would be required for AM or FSK. QPSK and QAM have higher spectral efficiency. However, QPSK and QAM are more susceptible to noise and distortion and therefore require a relatively higher SNR.
Also, for FSK, you want the two frequencies to be integer multiples of the data rate. This will result in an integer number of cycles in each bit period so that the carrier always ends up at the same level on data bit transitions. This probably won't be done at RF, though. Generally the FSK signal would be generated at an intermediate frequency which would then be mixed up to the actual RF carrier frequency.
The formulas are not calculating the same thing. The Nyquist bit rate formula is for a noiseless channel and calculates the maximum bit rate for a given channel bandwidth and number of signaling levels. Note that the formula uses the number of signal levels, not the actual levels. It simply combines the bandwidth, which determines how fast symbols can be sent, and the number of signal levels for each symbol which determines how many bits can be sent during one signaling interval, to the maximum number of bits/second that can be sent. The Shannon formula is for a channel with noise and combines the channel bandwidth and the signal-to-noise ratio to determine the maximum number of bits/second that can be sent over that channel. It does use signal level in the form of signal-to-noise ratio. The Nyquist formula, as already noted, does not use signal level because it is immaterial as it assumes there is no noise.
Best Answer
The Shannon capacity limit holds irrespective of modulation scheme used. It is the theoretical limit given an ideal choice of modulation and channel coding. The Shannon limit is as fundamental a rule in communications engineering as the first law of thermodynamics is in mechanical engineering.
For example, let's assume you have a 20 MHz wide AWGN channel with a 20 dB signal-to-noise ratio. The Shannon-Hartley theorem gives:
$$C = B\,\textrm{log}_2(1 + SNR) = 20E6 \times \textrm{log}_2(1 + 10^\frac{20}{10}) = 133.2 MBit/sec.$$
This is the upper limit. Provided you pick the optimum modulation scheme and forward error correction code, you can get 133.2 MBit/sec out of the channel, but no more. To get a higher error-free data rate, you need to either improve the SNR or increase your bandwidth.
Until the early 1990s, it was assumed that getting arbitrarily close to the Shannon limit was not feasible. That changed with the introduction of Turbo Codes, which underpin most 3G & LTE telecom networks. With a turbo code, you can get arbitrarily close to the Shannon limit provided that you can accept the processing required.