A couple of approaches which may be useful for some styles of display is to divide the display panel into tiles, and
- restrict each tile to using a small set of colors, allowing the use of fewer than 8 bits per pixel, or
- use a byte or two from each tile to select a location from which to read bitmap data.
The first approach could reduce the rate at which data had to be read from display memory. For example, if one used tiles that were 16x16 and could each have four colors chosen from a set of 256, then without using any extra RAM in the FPGA one could reduce the number of memory reads per 16 pixels to eight (four color values, plus four bytes for the bitmap). If one added 160 bytes' worth of buffering/RAM(*) to the FPGA, one could reduce the number of memory reads per 16 pixels to four, using an extra 160 reads every 16 scan lines to read the next set of tile colors. If one wanted 16 colors per tile, the second approach would require an extra 640 bytes of RAM unless one placed some restrictions on the number of different palettes that could exist on a line.
The second approach would probably increase rather than reduce the total memory bandwidth required to produce a display, but would reduce the amount of memory that would have to be updated to change the display--one could change a byte or two to update an 8x8 or 16x16 area of the screen. Depending upon what you're trying to display, it may be helpful when using this style of approach to use one memory device to hold the tile shapes, and another to hold the tile selection. One might, for example, use a fast 32Kx8 RAM to hold a couple 80x60 tile maps with two bytes per tile. If the FPGA didn't have any buffering, it would have to read one byte every four pixels; even with a 40ns static RAM, that would leave plenty of time for the CPU to update the display (an entire screen would only be 9600 bytes). The memory bandwidth for reading out the tile shapes would be no better than it is now, but that part of memory wouldn't have to be updated.
Incidentally, if one didn't want to add a 32Kx8 RAM but could add add 320 bytes of buffering/RAM(**) to the FPGA, one could use a tile-map approach but have the CPU or DMA feed 160 bytes to the display every 8 scan lines. That would burden the controller somewhat even when nothing on the display was changing, but could simplify the circuitry.
(*) The buffer could be implemented as RAM, or as a sequence of 32 40-bit-long shift registers plus a little control logic.
(**) The buffer could be implemented as two 160-byte RAMs, or as two groups of sixteen 80-bit shift registers.
No fundamental reason why not. Synchronous SRAM is truly random access, fairly inexpensive, and easy to interface to.
Its downside in that it occupies a fairly narrow niche between the on-chip BlockRam (not much smaller, free until it forces you to select a larger chip, massively parallel and more flexible) and external DRAM (massive storage capacity at a price SSRAM can't match).
So up to 0.5 or 1MB, external SSRAM is unnecessary, and above 8MB or 16MB (numbers may vary according to your budget and current prices!), SSRAM becomes expensive enough that DRAM takes over despite its limits.
Then - if you need random access - you have to massively reorganise the computation to read chunks (bursts or pages) from DRAM into BlockRam where you can process it fast before writing back bursts etc....
But if you have a role for SSRAM within that window, go for it. I have added simple home-made SSRAM boards to augment commercial FPGA platforms where necessary.
Best Answer
Here's how I'd do it...
I would start with a Xilinx Spartan-6 FPGA. The reason I'd go with this is because they have "hard cores" for a DDR-SDRAM interface. By hard core, I mean that the circuitry for this memory interface is a dedicated chunk of logic and not in the normal "user programmed fabric of logic". This means that you're going to meet timing, and you don't have to write this logic on your own.
Next, I'd hook up some DDR2 SDRAM to the part. DDR2 SDRAM is fairly inexpensive, easy to get, and certainly fast and large enough for what you want to do. I'd start with a 16-bit wide data bus, and increase that if you need more speed. You can use the Xilinx CoreGen or Memory Interface Generator to get your DDR2 interface core.
The rest is "relatively easy", in that it's just moving data around and generating the proper sync pulses.
One major down-side to this approach is that you're basically limited to using BGA's for both the memory and FPGA. One plus side is that there are FPGA development boards that already have this circuitry on it.