Here's how I'd do it...
I would start with a Xilinx Spartan-6 FPGA. The reason I'd go with this is because they have "hard cores" for a DDR-SDRAM interface. By hard core, I mean that the circuitry for this memory interface is a dedicated chunk of logic and not in the normal "user programmed fabric of logic". This means that you're going to meet timing, and you don't have to write this logic on your own.
Next, I'd hook up some DDR2 SDRAM to the part. DDR2 SDRAM is fairly inexpensive, easy to get, and certainly fast and large enough for what you want to do. I'd start with a 16-bit wide data bus, and increase that if you need more speed. You can use the Xilinx CoreGen or Memory Interface Generator to get your DDR2 interface core.
The rest is "relatively easy", in that it's just moving data around and generating the proper sync pulses.
One major down-side to this approach is that you're basically limited to using BGA's for both the memory and FPGA. One plus side is that there are FPGA development boards that already have this circuitry on it.
A couple of approaches which may be useful for some styles of display is to divide the display panel into tiles, and
- restrict each tile to using a small set of colors, allowing the use of fewer than 8 bits per pixel, or
- use a byte or two from each tile to select a location from which to read bitmap data.
The first approach could reduce the rate at which data had to be read from display memory. For example, if one used tiles that were 16x16 and could each have four colors chosen from a set of 256, then without using any extra RAM in the FPGA one could reduce the number of memory reads per 16 pixels to eight (four color values, plus four bytes for the bitmap). If one added 160 bytes' worth of buffering/RAM(*) to the FPGA, one could reduce the number of memory reads per 16 pixels to four, using an extra 160 reads every 16 scan lines to read the next set of tile colors. If one wanted 16 colors per tile, the second approach would require an extra 640 bytes of RAM unless one placed some restrictions on the number of different palettes that could exist on a line.
The second approach would probably increase rather than reduce the total memory bandwidth required to produce a display, but would reduce the amount of memory that would have to be updated to change the display--one could change a byte or two to update an 8x8 or 16x16 area of the screen. Depending upon what you're trying to display, it may be helpful when using this style of approach to use one memory device to hold the tile shapes, and another to hold the tile selection. One might, for example, use a fast 32Kx8 RAM to hold a couple 80x60 tile maps with two bytes per tile. If the FPGA didn't have any buffering, it would have to read one byte every four pixels; even with a 40ns static RAM, that would leave plenty of time for the CPU to update the display (an entire screen would only be 9600 bytes). The memory bandwidth for reading out the tile shapes would be no better than it is now, but that part of memory wouldn't have to be updated.
Incidentally, if one didn't want to add a 32Kx8 RAM but could add add 320 bytes of buffering/RAM(**) to the FPGA, one could use a tile-map approach but have the CPU or DMA feed 160 bytes to the display every 8 scan lines. That would burden the controller somewhat even when nothing on the display was changing, but could simplify the circuitry.
(*) The buffer could be implemented as RAM, or as a sequence of 32 40-bit-long shift registers plus a little control logic.
(**) The buffer could be implemented as two 160-byte RAMs, or as two groups of sixteen 80-bit shift registers.
Best Answer
The external SRAM on the MicroNova Mercury board is a fast one — 10 ns nominal speed — so there should be no problem accessing it at 50 MB/s (or more). You could easily read out a full VGA signal at a raw rate of 25.175 MB/s and still have half of the memory bandwidth available for writes. Of course, the SRAM does not have enough room for two 640×480×8 bits buffers, so you'll have to cut back somewhere. If you do 320×240 @ 60 Hz, you'll only need to read out 4.608 MB/s on average, leaving more than 90% of the memory bandwidth for pixel writing and other purposes.
I see no need to use internal BRAM for pixel buffering; it's a resource that you will find much more useful for other things, such as implementing the controller(s) that will be running the game logic and drawing the pixels.