Your real confusion seems to be a fundamental misunderstanding of what DSPs do. DSPs are optimized to perform convolutions. Since a coeficient has to be stored and a multiply-accumulate performed for each point of the convolution, the number of points is limited by memory and available processor time. The convolutions therefore by necessity must be some finite width, so these types of filters are often referred to as finite impulse response, or FIR.
Other than the restriction on the width of the convolution, nothing in the DSP hardware says what you can do with that convolution, or more specifically, what coeficients you can use. All the coeficients together form the function you are convolving a input signal with. They are sometimes collectively called the filter kernel.
There are many possible uses for this basic capability provided by DSPs. Sometimes the desire is to eliminate all content past some frequency while not altering content below that frequency, but that is only one of many useful things a wide digital convolution can do.
However, even when a DSP is used in this way, it is not done with a "window of rectangular shape". There will always be a window of some finite size (that's the basis of a FIR filter), but the shape of that window is rarely rectangular. Using DSP hardware to implement a rectangular filter is rather a waste. Since all coeficients are equal, you can implement this specific case of convolution with a circular buffer, two multiplies, and two adds per sample, regardless of how wide the buffer is. This is sometimes called a "moving average" filter, or "box" filter. For most purposes these don't have very good characteristics. They seem to be used a lot for two reasons: They are the knee jerk reaction of those that didn't pay attention in signal processing class, and they are conceptually easy to implement.
The specific case of a sharp cutoff low pass filter requires the filter kernel to be a sinc function. A sinc in the time domain maps to a rectangle in the frequency domain, and vice versa.
You also seem to be confused in that a FFT is somehow envolved. A fourier transform or lots of other analisys tools may be used to determine what the filter kernel should be, but once the kernel coeficients are determined it's all just a convolution at run time. If you start out knowing what you want to do to a signal in terms of a frequency domain multiplication, then it takes a fourier transform to find the filter kernel that will realize that operation in the time domain as a convolution. However, there are many possible criteria for manipulating a signal, and not all of those may be expressed in the frequency domain. Some may come at you directly in the time domain, in which case no fourier analisys may be needed to determine the filter kernel.
The main reason that frequency-domain processing isn't done directly is the latency involved. In order to do, say, an FFT on a signal, you have to first record the entire time-domain signal, beginning to end, before you can convert it to frequency domain. Then you can do your processing, convert it back to time domain and play the result. Even if the two conversions and the signal processing in the middle are effectively instantaneous, you don't get the first result sample until the last input sample has been recorded. But you can get "ideal" frequency-domain results if you're willing to put up with this. For example, a 3-minute song recorded at 44100 samples/second would require you to do 8 million point transforms, but that's not a big deal on a modern CPU.
You might be tempted to break the time-domain signal into smaller, fixed-size blocks of data and process them individually, reducing the latency to the length of a block. However, this doesn't work because of "edge effects" — the samples at either end of a given block won't line up properly with the corresponding samples of the adjacent blocks, creating objectionable artifacts in the results.
This happens because of assumptions that are implicit in the process that converts between time domain and frequency domain (and vice-versa). For example, the FFT and IFFT "assume" that the data is cyclic; in other words, that blocks of identical time-domain data come before and after the block being processed. Since this is in general not true, you get the artifacts.
Time-domain processing may have its issues, but the fact that you can control the latency and it doesn't produce periodic artifacts make it a clear winner in most real-time signal-processing applications.
(This is an expanded version of my previous answer.)
Best Answer
Well it's stored digitally now, right? so are you planning on putting your microphone next to the speaker after an analog filter to re-record it?
Enough messing around, I'll be serious.
In order to make a filter attenuate more in a smaller range of frequencies, aka making the frequency response curve more vertical, then you just need to increase the order of the filter.
That is something that is reasonably easy to do in Matlab. It's also something that is feasibly to do post-processing. It's also about repeatability, if you apply the filter on a sunny day today, then you expect it to work identically to tomorrow when it's raining. You expect it to work exactly the same, right?
In analog circuits you have all these "5% resistor", "1% capacitor", and all other stuff. So if you want to make something exact you will definitely need to trim the circuit afterwards so it matches your desired filter perfectly. If you want to increase the order of the filter... then sadly.. it will make the filter so much larger physically. Instead of taking up the size of a credit card, it will take up the size of, I don't know, depends on filter order and what you're okay with.
Regarding the repeatability, doing something today.. warm.. tomorrow.. colder... the resistances will change ever so slightly, the frequency response will change, a couple of Hz there, some there, the more components you got in your circuits, there more likely it is that your components will change their values. And then you have humidity, oxidizing...
And here's the punchline that I should've said first, you can't really post-process it, unless you got cassette tapes. I'm not 100% sure what analog musical medium that is being used to record / delete easily. LP discs would be a nightmare...
And let's not forget the price. One is software, if you write it yourself then it's essentially for free, the other requires components, physical parts.
But don't think analog filters are bad, they got their uses, such as removing nasty harmonics in large DC motors, or making ultra silent stepper motors for 3D-printers by smoothing out the current. And tons of other uses. - Also if you would solve it with an analog filter, no one would think it would be a bad solution.
I believe I'm indirectly answering why FFT is a better way to go about it, post-processing wise. The bottom line is that it's much cheaper to do. You could also just apply a notch filter if you know what frequency the noise is at. Or a wider, aka bandstop filter.
And last thing I want to add... woaw this answer is so long, I'm sorry. But if you use an analog filter and you... mess up with your calculations and then think it's all fine and dandy and use it in some serious event, like interviewing the king of Sweden (Knugen). And you messed up with the sizing of a capacitor, instead of filtering 16kHz noise, you're filtering out 4kHz "noise". If you instead deal with it digitally then it's just a matter of changing some variables, you don't need to desolder -> solder another component. Also the interview is ruined.