In principle this is good candidate for FPGA based design. Regarding your requirements:
ad 1. The FPGA most likely will be more expensive, by how much that depends on the device you choose. At first glance smallest Spartan 3 from Xilinx (XC3S50AN) will be more then enough for this task (~10£ from Farnell). I think you can assume this is upper boundary for the cost (it has 56kB RAM inside, so it is more then you need). You may find cheaper device either from Xilinx offering or their competitors Altera and Lattice.
ad 2. The package is the tough issue, I did not saw FPGA with smaller footprint either. However maybe you can use CPLD device (for sake of argument the CPLDs are small FPGAs) which may be in smaller package (PLCC or QFN). On plus side they will be cheaper (even single $) on negative side most likely will not have RAM inside. With CPLD probably you would need external SRAM chip.
ad 3. FPGAs and CPLD current consumption is highly dependent on the programmed design. However there is good chance that FPGA and especially CPLD design would consume less than your current solution.
ad 4. FPGA do have that kind of memory inside, CPLD most certainly not. This may be solved by external sram chip (or two). For example:
|SRAM 1| <--> |CPLD| <--> |uC|
|SRAM 2| <-->
In such arrangement while the uC is writing to SRAM 1, the CPLD will be displaying data from SRAM 2. The CPLD should be able to handle both task simultaneously.
Of course you can solve this in other ways too:
1) use faster uController (ARM for example)
2) use device with some programmable fabric and uC inside (for example FPSLIC from Atmel, however I have never used such devices and I know very little about those)
Standard disclaimer -> as designs are open problems, with many constrains and possible solutions whatever I wrote above may not be true for your case. I believe it is worth checking those option, though.
Best Answer
CPU's are sequential processing devices. They break an algorithm up into a sequence of operations and execute them one at a time.
FPGA's are (or, can be configured as) parallel processing devices. An entire algorithm might be executed in a single tick of the clock, or, worst case, far fewer clock ticks than it takes a sequential processor. One of the costs to the increased logic complexity is typically a lower limit at which the device can be clocked.
Bearing the above in mind, FPGA's can outperform CPU's doing certain tasks because they can do the same task in less clock ticks, albeit at a lower overall clock rate. The gains that can be achieved are highly dependent on the algorithm, but at least an order of magnitude is not atypical for something like an FFT.
Further, because you can build multiple parallel execution units into an FPGA, if you have a large volume of data that you want to pass through the same algorithm, you can distribute the data across the parallel execution units and obtain further orders of magnitude higher throughput than can be achieved with even a multi-core CPU.
The price you pay for the advantages is power consumption and $$$'s.