Your real confusion seems to be a fundamental misunderstanding of what DSPs do. DSPs are optimized to perform convolutions. Since a coeficient has to be stored and a multiply-accumulate performed for each point of the convolution, the number of points is limited by memory and available processor time. The convolutions therefore by necessity must be some finite width, so these types of filters are often referred to as finite impulse response, or FIR.
Other than the restriction on the width of the convolution, nothing in the DSP hardware says what you can do with that convolution, or more specifically, what coeficients you can use. All the coeficients together form the function you are convolving a input signal with. They are sometimes collectively called the filter kernel.
There are many possible uses for this basic capability provided by DSPs. Sometimes the desire is to eliminate all content past some frequency while not altering content below that frequency, but that is only one of many useful things a wide digital convolution can do.
However, even when a DSP is used in this way, it is not done with a "window of rectangular shape". There will always be a window of some finite size (that's the basis of a FIR filter), but the shape of that window is rarely rectangular. Using DSP hardware to implement a rectangular filter is rather a waste. Since all coeficients are equal, you can implement this specific case of convolution with a circular buffer, two multiplies, and two adds per sample, regardless of how wide the buffer is. This is sometimes called a "moving average" filter, or "box" filter. For most purposes these don't have very good characteristics. They seem to be used a lot for two reasons: They are the knee jerk reaction of those that didn't pay attention in signal processing class, and they are conceptually easy to implement.
The specific case of a sharp cutoff low pass filter requires the filter kernel to be a sinc function. A sinc in the time domain maps to a rectangle in the frequency domain, and vice versa.
You also seem to be confused in that a FFT is somehow envolved. A fourier transform or lots of other analisys tools may be used to determine what the filter kernel should be, but once the kernel coeficients are determined it's all just a convolution at run time. If you start out knowing what you want to do to a signal in terms of a frequency domain multiplication, then it takes a fourier transform to find the filter kernel that will realize that operation in the time domain as a convolution. However, there are many possible criteria for manipulating a signal, and not all of those may be expressed in the frequency domain. Some may come at you directly in the time domain, in which case no fourier analisys may be needed to determine the filter kernel.
There is a direct, and actually quite simple, relationship between all the figures.
Let's start with the sample size. The numbers of bins (or "buckets") is equal with half of the samples in your set. For instance, if you have 1024 samples, then you get 512 bins. As simple as that.
Now for the sample rate. The maximum frequency is half the sample rate (see Nyquist-Shannon sampling theorem). So if you have a sample rate of 2.67ksps then your frequency range is 0-1.335kHz.
Again pretty straight forward - just another divide by two.
Now the bins are spread evenly over the frequency range - so your 512 bins, over 1355Hz is 2.607421875 Hz per bin.
For 0.5Hz per bin up to 300Hz you want 600 bins. Your target sample rate would be 600Hz. How you down-sample to that would be up to you.
Most FFT code I have seen works on 2n sample sizes, so 600 bins isn't a nice number. That would be 1200 samples - not a 2n. So you'd probably want to round it up to 211, or 2048 samples. That would give you 1024 bins, and you'd want a target sample rate of 1024sps for 0-512Hz range.
Best Answer
Since you are working with a fixed sample rate, your FFT length (which will require your window to be at the same width) will increase your frequency resolution. The benefit of having a finer frequency resolution is twofold: the apparent one is that you get a finer freqeuecy resolution, so that you might be able to distinguish two signals that are very close in frequency. The second one is that, with a higher frequency resolution, your FFT noise floor will be lower. The noise in your system has a fixed power, unrelated to the number of points of your FFT, and that power is distributed evenly (if we're talking white noise) to all your frequency components. Thus, having more frequency components mean that individual noise contribution of your frequency bins will be lowered, while the total integrated noise stays the same, which results in a lower noise floor. This will allow you to distinguish a higher dynamic range.
However, there are drawbacks to using a longer FFT. First one is that you'll need more processing power. The FFT is a O(NlogN) algorithm, where N is the number of points. While it may not be as dramatic as the naive DFT, the increase in N will start to bleed your processor, especially if you're working in the confines of an embedded system. Secondly, when you increase N, you're gaining frequency resolution while you're losing time resolution. With a bigger N, you need to take more samples to arrive at your frequency domain result, which means that you need to take samples for a longer time. You will be able to detect a higher dynamic range and finer frequency resolution, but if you're looking for spurs, you'll have a less clear idea about WHEN that spur occurred exactly.
The type of window you should use is a whole other subject, which I'm not that informed to give you an answer to WHICH one is better. However, different windows have different output characteristics, of which most(if not all) are reversible post processing the FFT result. Some windows may make your frequency components bleed to side bins (if I'm not mistaken, the Hanning window makes your components appear on three bins.), others may give you a better frequency accuracy while introducing some gain error to your components. This is completely dependent to the nature of result you're trying to achieve, so I'd do some research (or some simulations) to arrive at which one is the best for your specific application.