I've done this a few times myself.
Generally, the design tools will choose between a fabric implementation and a DSP slice based on the synthesis settings.
For instance, for Xilinx ISE, in the synthesis process settings, HDL Options, there is a setting "-use_dsp48" with the options:
Auto, AutoMax, Yes, No.
As you can imagine, this controls how hard the tools try to place DSP slices. I once had a problem where I multiplied an integer by 3, which inferred a DSP slice - except I was already manually inferring every DSP slice in the chip, so the synth failed! I changed the setting to No, because I was already using every dsp slice.
This is probably a good rule of thumb (I just made up):
if your design is clocked at less than 50 MHz, and you're probably going to use less than 50% of the DSP slices in the chip, then just use the *, +, and - operators. this will infer DSP slices with no pipeline registers. This really limits the top speed. (I have no idea what happens when you use division)
However, if it looks like you're going to run the slices closer to the max speed of the DSP slice (333 MHz for Spartan 6 normal speed grade) Of you're going to use all of the slices, you should manually infer them.
In this case, you have two options.
Option 1: manually use the raw DSP instantiation template.
Option 2: use a IP block from Xilinx Core Generator. ( I would use this option. At the same time, you will learn all about core gen, which will help in the future)
Before you do either of these, read the first couple of pages of the DSP slice user guide. In the case of the Spartan 6, (DSP48A1), that would be Xilinx doc UG389:
http://www.xilinx.com/support/documentation/user_guides/ug389.pdf
Consider the Core Generator option first. I usually create a testing project in Core Generator for the part I'm working with, where I create any number of IP blocks just to learn the system. Then, when I'm ready to add one to my design in ISE, I right click in the Design Hierarchy, click new source, and select "IP (CORE Generator & Architecture Wizard)" so that I can edit and regenerate the block directly from my project.
In Core gen, take a look at the different IP blocks you can choose from - there are a few dozen, most of which are pretty cool.
The Multiplier Core is what you should look at first. Check out every page, and click the datasheet button. The important parts are the integer bit widths, the pipeline stages (latency) and any control signals. This produces the simplest possible block by taking away all the ports you don't need.
When I was building a 5 by 3 order IIR filter last year, I had to use the manual instantiation template since I was building a very custom implementation, with 2 DSP slices clocked 4x faster than the sample rate. It was a total pain.
I generally take a top-down design approach, and I start by drawing a block diagram that shows the interfaces among the top-level blocks. I then draw additional diagrams that represent the implementations of the top-level blocks in terms of lower-level blocks.
This hierarchy of block diagrams translates pretty much directly to the hierarchy of the HDL modules. Once I get to a low enough level of detail on the block diagrams, I start coding and stop drawing diagrams.
The block diagrams also function as dataflow diagrams, since they show at every stage how the data flows from one module to another.
When it comes to specific interfaces between modules, I also draw timing diagrams that show the details of the interface protocol. I also use timing diagrams to keep track of the flow of data through the pipeline stages inside a module. In both cases, these diagrams serve as a reference when looking at waveforms in the simulator during verification.
Best Answer
Why do you want to use an FPGA? The requirements you've mentioned above don't give any hint of anything with sufficient processing and parallelism demand to suit an FPGA. In fact, Bluetooth is a massive endeavour in it's own right, even in software, I wouldn't fancy trying to build a Bluetooth stack in VHDL, even as an educational exercise!
If you want a cheaper Arduino, redesign the PCB to use a cheaper, but compatible, device, leave off the bits you don't need, integrate any new bits you'd like. Even that is a lot of work (especially if you want to create a fully-validated product which you can CE-mark, get FCC approval for, etc.).
And the next step might be to use an even cheaper micro, with all the extra software effort that that entails.
But I wouldn't consider an FPGA for this job.