Dithering is one way, as in "rawb"'s answer. In audio, the usual accepted standard for plain dithering was a triangular PDF dither with a peak-peak amplitude of 1 LSB, added to the high res (e.g. analog) signal before quantisation (e.g. the ADC). The same applied not just to ADCs but to any other truncation process, such as going from studio equipment down to 16 bit for CD mastering.
This triangular PDF signal was easily generated as the sum of two uniform PDF dither signals, each 0.5 LSB pk-pk amplitude, from indepenent (or at least uncorrelated) random or pseudorandom generators.
A lot of work was done on this in the 1980s, among others by Decca in London who built their own studio equipment, and they showed that with TPDF dither, signals (pure tones) could be detected about 20dB below the (broad band) noise floor, with no observable harmonic distortion (i.e. nothing distinguishable from noise)
Another way is applicable if the bandwidth of interest is less than the Nyquist bandwidth, as is usually the case in oversampling converters.
Then you can improve massively on the plain dithered results. This approach, noise shaping, generally involves embedding the dithered quantiser in a closed loop with a filter in the feedback path. With a simple filter you can get one extra bit of resolution per halving in frequency as Jon Watte says in a comment, but with a third order filter you can do considerably better than this.
Consider that a 256x oversampling converter ought to give 8 bits additional resolution according to the above equation, however 1-bit converters operating this way routinely give 16 to 20 bit resolution.
You end up with very low noise in the bandwidth of interest (thanks to high loop gain at those frequencies), and very high out-of-band noise somewhere else, easy to filter out in a later stage (e.g. in a decimation filter). The exact result depends on the loop gain as a function of frequency.
Third and higher order filters make it increasingly difficult to stabilise the loop, especially if it starts generating incorrect results during overload (clipping or overflow) conditions. If you're careless or unlucky you can get rail-to-rail noise...
Lots of papers from circa 1990 and onwards by Bob Adams of dBX, Malcolm Hawksford of Essex University and many others about noise shaping converters, in the JAES (Journal of the Audio Engineering Society) and elsewhere.
Interesting historical note : when CD was first being standardised, the Philips 14 bit CD proposal went head to head with Sony's 16-bit LP-sized disk. They compromised on the slightly larger CD we still have today with 16 bits and allegedly at Morita-san's insistence, enough recording time for Beethoven's Ninth Symphony.
Which left Philips with a pile of very nice but now useless 14-bit DACs...
So Philips first CD players drove these DACs at 4x the sampling rate, with a simple noise shaping filter (may have been 2nd order but probably first order) and achieved performance closer to 16 bits than contemporary 16-bit DACs could. For 1983, ... Genius.
A general rule of thumb is that is you want something to not contribute to your noise budget, that it must be at least a factor of 10 higher SNR than the dominant noise source in your signal chain. As an example, if you have a signal source that is at 300 :1 SNR, run your ADC at 3000:1 and for all intents and purposes you can ignore the ADC.
The only way to do this properly is to do a noise analysis.
Post processing (via in DSP for example) has the potential to extract out salient features from above the noise but you have to be careful. You have to have sufficient bit depth so you don't introduce rounding/truncation errors. You have to ensure that you are conserving the nature of the noise (gaussian/poisson pdf) or else the noise floor may rise in an unpredictable way and may not be amenable to DSP techniques. These sorts of steps (matched filters etc.) typically at best can improve the SNR by factors of \$ \sqrt{N} \$ and often the processing cost (# of operations) often follows \$ N^2 \$ so these sorts of steps often become rapidly very expensive. But agains a proper analysis will show this.
I would caution you against assuming that a DSP technique will automatically reduce your noise. It is very important that you lot at your noise sources via histogram analysis to ensure that the PDF (Probability Density Function) is amenable to processing. I.e. it appears well behaved, Gaussian or Poisson, is not multivariate and is stationary
Best Answer
The statement:
is incorrect. The analog bandwidth is going to be no more than half the sampling rate. This calculation is not necessary anyway, since you already have the RMS value for this noise.
What you need to do is compute the corresponding RMS value for the analog noise at the ADC input, which is \$5\times10^{-4}\frac{V}{\sqrt{Hz}}\times\sqrt{5000 Hz} = 3.5\times10^{-2}V\$. It will be less if you can band-limit the input signal to something less than the Nyquist bandwidth.
But this gives you a worst-case scenario. It basically says that you have roughly a 100:1 (40 dB) SNR (relative to a full-scale signal) at the ADC input, which would suggest that anything over about 7 bits will be enough.
To address the broader issues you raise: The real question is what is the probability distribution that each source of noise introduces into the stream of samples. The quantizaiton noise is uniformly distributed, and has a peak-to-peak amplitude that's exactly equal to the step size of the ADC: 3V/4096 = 0.732 mV.
In comparison, the AWGN over a 5000 Hz bandwidth has an RMS value of 35 mV, which means that the peak-to-peak value is going to be less than 140 mV 95% of the time and less than about 210 mV 99.7% of the time. In other words, your digital sample words will have a distribution of ±70 mV/0.732 mV = ±95 counts around the correct value, 95% of the time.
EDIT:
Be careful — you're comparing a peak-to-peak signal value to an RMS noise value. Your actual peak-to-peak noise value is going to be about 4× the RMS value (95% of the time), so you're really getting about 14 bits of SNR.
The 12-bit resolution is quantization noise. And yes, its effects are reduced by subsequent narrow-bandwidth filtering.
Yes. Narrow-bandwidth filtering is a kind of long-term averaging. And the wide-bandwidth sampling is oversampled with respect to the filter output. Since the signal contains a signficant amount of noise prior to quantization, this noise serves to "dither" (randomize) the signal, which, when combined with narrowband filtering in the digital domain, effectively "hides" the effects of quantization.
It might be a little more obvious if you think about it in terms of a DC signal and a 0.01-Hz lowpass (averaging) filter in the digital domain. The mean output of the filter will be the signal value plus the mean value of the noise. Since the latter is zero, the result will be the signal value. The quantization noise is "swamped out" by the analog noise. In the general case, this applies to any narrowband filter, not just a low-pass filter.