I've done this a few times myself.
Generally, the design tools will choose between a fabric implementation and a DSP slice based on the synthesis settings.
For instance, for Xilinx ISE, in the synthesis process settings, HDL Options, there is a setting "-use_dsp48" with the options:
Auto, AutoMax, Yes, No.
As you can imagine, this controls how hard the tools try to place DSP slices. I once had a problem where I multiplied an integer by 3, which inferred a DSP slice - except I was already manually inferring every DSP slice in the chip, so the synth failed! I changed the setting to No, because I was already using every dsp slice.
This is probably a good rule of thumb (I just made up):
if your design is clocked at less than 50 MHz, and you're probably going to use less than 50% of the DSP slices in the chip, then just use the *, +, and - operators. this will infer DSP slices with no pipeline registers. This really limits the top speed. (I have no idea what happens when you use division)
However, if it looks like you're going to run the slices closer to the max speed of the DSP slice (333 MHz for Spartan 6 normal speed grade) Of you're going to use all of the slices, you should manually infer them.
In this case, you have two options.
Option 1: manually use the raw DSP instantiation template.
Option 2: use a IP block from Xilinx Core Generator. ( I would use this option. At the same time, you will learn all about core gen, which will help in the future)
Before you do either of these, read the first couple of pages of the DSP slice user guide. In the case of the Spartan 6, (DSP48A1), that would be Xilinx doc UG389:
http://www.xilinx.com/support/documentation/user_guides/ug389.pdf
Consider the Core Generator option first. I usually create a testing project in Core Generator for the part I'm working with, where I create any number of IP blocks just to learn the system. Then, when I'm ready to add one to my design in ISE, I right click in the Design Hierarchy, click new source, and select "IP (CORE Generator & Architecture Wizard)" so that I can edit and regenerate the block directly from my project.
In Core gen, take a look at the different IP blocks you can choose from - there are a few dozen, most of which are pretty cool.
The Multiplier Core is what you should look at first. Check out every page, and click the datasheet button. The important parts are the integer bit widths, the pipeline stages (latency) and any control signals. This produces the simplest possible block by taking away all the ports you don't need.
When I was building a 5 by 3 order IIR filter last year, I had to use the manual instantiation template since I was building a very custom implementation, with 2 DSP slices clocked 4x faster than the sample rate. It was a total pain.
Best Answer
Here's a flexible pattern I've used a lot for this and similar purposes.
I prefer to use the actual clock period and delay values to generate the count values, rather than calculating magic numbers.
It will generate a counter - but a 16 bit counter is invisibly small in any FPGA you're likely to find today. Beyond about 24 bits it may start to impact speed, then you can break it into two smaller counters, using the first as a prescaler, generating a clock enable for the second.
And the pattern shown re-uses the same counter, to matter how many different delay values you need - unless you need more than one delay simultaneously.