Typically ASIC design is a team endeavor due to the complexity and quantity of work. I'll give a rough order of steps, though some steps can be completed in parallel or out of order. I will list tools that I have used for each task, but it will not be encyclopedic.
Build a cell library. (Alternatively, most processes have gate libraries that are commercially available. I would recommend this unless you know you need something that is not available.) This involves designing multiple drive strength gates for as many logic functions as needed, designing pad drivers/receivers, and any macros such as an array multiplier or memory. Once the schematic for each cell is designed and verified, the physical layout must be designed. I have used Cadence Virtuoso for this process, along with analog circuit simulators such as Spectre and HSPICE.
Characterize the cell library. (If you have a third party gate library, this is usually done for you.) Each cell in your library must be simulated to generate timing tables for Static Timing Analysis (STA). This involves taking the finished cell, extracting the layout parasitics using Assura, Diva, or Calibre, and simulating the circuit under varying input conditions and output loads. This builds a timing model for each gate that is compatible with your STA package. The timing models are usually in the Liberty file format. I have used Silicon Smart and Liberty-NCX to simulate all needed conditions. Keep in mind that you will probably need timing models at "worst case", "nominal", and "best case" for most software to work properly.
Synthesize your design. I don't have experience with high level compilers, but at the end of the day the compiler or compiler chain must take your high level design and generate a gate-level netlist. The synthesis result is the first peek you get at theoretical system performance, and where drive strength issues are first addressed. I have used Design Compiler for RTL code.
Place and Route your design. This takes the gate-level netlist from the synthesizer and turns it into a physical design. Ideally this generates a pad-to-pad layout that is ready for fabrication. It is really easy to set your P&R software to automatically make thousands of DRC errors, so not all fun and games in this step either. Most software will manage drive strength issues and generate clock trees as directed. Some software packages include Astro, IC Compiler, Silicon Encounter, and Silicon Ensemble. The end result from place and route is the final netlist, the final layout, and the extracted layout parasitics.
Post-Layout Static Timing Analysis. The goal here is to verify that your design meets your timing specification, and doesn't have any setup, hold, or gating issues. If your design requirements are tight, you may end up spending a lot of time here fixing errors and updating the fixes in your P&R tool. The final STA tool we used was PrimeTime.
Physical verification of the Layout. Once a layout has been generated by the P&R tool, you need to verify that the design meets the process design rules (Design Rule Check / DRC) and that the layout matches the schematic (Layout versus Schematic / LVS). These steps should be followed to ensure that the layout is wired correctly and is manufacturable. Again, some physical verification tools are Assura, Diva, or Calibre.
Simulation of the final design. Depending on complexity, you may be able to do a transistor-level simulation using Spectre or HSPICE, a "fast spice" simulation using HSIM, or a completely digital simulation using ModelSim or VCS. You should be able to generate a simulation with realistic delays with the help of your STA or P&R tool.
Starting with an existing gate library is a huge time saver, as well as using any macros that benefit your design, such as memory, a microcontroller, or alternative processing blocks. Managing design complexity is a big part as well - a single clock design will be easier to verify than circuit with multiple clock domains.
FPGA manufacturers don't use equivalent gate counts much any more, even in the hand-wavyest marketing materials. Like lines of code or megahertz of processor speed, it's a highly inaccurate metric for measuring the device capability, and in the FPGA markets the customers wised up enough to suppress its use.
To estimate the size device you need, you'll need to look at the summary on p. 2 of the datasheet you linked. Usually you can get a decent idea early on in your design process how many flip-flops, how many i/o's and how much ram your design needs. One or the other of those will typically be the critical resource that determines the size of part you need.
If you aren't tightly cost-constrained, use a device 2x or more bigger than you think you need. It will give you room for feature creep in your design and also speeds up development because the design tools won't need to work so hard to fit your design into the available resources.
Edit, pulling in things from comments,
You mentioned that your design is mostly unclocked.
The issue with this is that FPGA design tools depend on clocking and the resulting timing constraints to drive optimization of the synthesized design. If you want to do unclocked design in an FPGA it's possible in principle, but you're not going to get much help from the tools (or vendors) and you'll probably need to find a specialized community who do that kind of thing to get any support.
In any case, you can look at the Spartan 6 Configurable Logic Block User's Guide to see what resources are available in each block. Then mentally map your design to those resource to see how many blocks you need. That should be enough to let you pick the right size device.
For example, you can see in that document that the LX45 part contains about 27,000 6-input LUTs. Each LUT can be used to implement an arbitrary combinatorial logic with up to 6 inputs. If you can express your logic in terms of this primitive, you can estimate whether your design fits into the device.
Best Answer
There really isn't one answer as there is so many different ways to instantiate a synthesizable design.
One way would be to use your synthesizable RTL and resynthesize it with another tool and target a different library.
For hand counting, you just need to look at your Mults, FFs and LUTs. The slices are organization hierarchical blocks and taps are routing resources, neither should affect you gate count majority.
Mults and FFs are easy enough to account, for the main issue is going to be your LUTs as there is often way more capability embedded into a LUT than is used, or the LUTs are cascaded in a particular fashion to prevent a speed impact. It will be more accurate to look at the logic cloud in your RTL map to the equivalent 2 input gate.