In effect, you're trying to recreate a color CRT controller with memory interface. This is perfectly possible, but it's much more involved than you realize. The physical implementation can be either an FPGA, as alex forencich suggests, or discrete chips. The discrete section will need something like the 74FCT series for horizontal timing, and can easily get by with 74HC for vertical timing.
First, as you realize, you'll have to generate display timing at 25 MHz - except that you won't. A 25 MHz VGA pixel clock implies that you're trying for 640 x 480 pixels, and this cannot be stored in a 64 kB RAM - it would require a 512 kB RAM. Instead, a 64 kB RAM will only support a 256 x 256 display, and this will only require a pixel clock of about 12.5 MHz. This is straightforward using a 9-bit binary synchronous counter, and can be realized with 3 74FCT161 counters. The vertical timing also uses a 9-bit counter, but 74HC161s can be used, since vertical timing is much slower than horizontal. The outputs of the two counters feed at least one static RAM, and there are at least 3 different approaches you can use for the interface.
1) FIFO - This is your first thought, but it's more complicated than you think. First, it only makes sense to transfer one byte (or 6 bits) of intensity data at a time, but you also have to store the address as well as the data. If you're going with a 64kB RAM, this means 16 bits of address along with 6 to 8 bits of intensity, and you'll need more than one FIFO. This in turn means that you'll need to ensure that the FIFOs remain synchronized. You'll also need to provide a mechanism to monitor the FIFO empty line and generate a write pulse to video RAM whenever the FIFO is not empty: that is, whenever there is data in the FIFO waiting to be written. Furthermore, you'll also need to provide a mechanism to keep memory writes from interfering with display reads. You can do this either by running the video RAM at 25 MHz, but alternating read and write cycles, or by permitting writes to RAM only during the non-display portions of the scan. This will occur during front porch, back porch, sync, or vertical blanking intervals.
2) Dual port RAM - Here's another device technology to look at. In this case you use the DPRAM as the video buffer, and feed one side from the video controller and the other from the Arduino. Be forewarned, a 64k x 8 DPRAM requires a package with a lot of pins.
3) Bank Switching - In this technique, you provide 2 video RAMs, and at any time one is being written to while the other is being read from. The state of the RAM is controlled by a flip-flop which can be triggered by the Arduino. So first (let's say) you read from bank A while writing to bank B. When you've completed writing a complete frame, you toggle the bank selector and the video is now read from B while A is being written to. This is in some ways more straightforward than the other two, but it does not permit local overwriting of areas of the image in the same way the other two approaches do.
Best Answer
A couple of approaches which may be useful for some styles of display is to divide the display panel into tiles, and
- restrict each tile to using a small set of colors, allowing the use of fewer than 8 bits per pixel, or
- use a byte or two from each tile to select a location from which to read bitmap data.
The first approach could reduce the rate at which data had to be read from display memory. For example, if one used tiles that were 16x16 and could each have four colors chosen from a set of 256, then without using any extra RAM in the FPGA one could reduce the number of memory reads per 16 pixels to eight (four color values, plus four bytes for the bitmap). If one added 160 bytes' worth of buffering/RAM(*) to the FPGA, one could reduce the number of memory reads per 16 pixels to four, using an extra 160 reads every 16 scan lines to read the next set of tile colors. If one wanted 16 colors per tile, the second approach would require an extra 640 bytes of RAM unless one placed some restrictions on the number of different palettes that could exist on a line.The second approach would probably increase rather than reduce the total memory bandwidth required to produce a display, but would reduce the amount of memory that would have to be updated to change the display--one could change a byte or two to update an 8x8 or 16x16 area of the screen. Depending upon what you're trying to display, it may be helpful when using this style of approach to use one memory device to hold the tile shapes, and another to hold the tile selection. One might, for example, use a fast 32Kx8 RAM to hold a couple 80x60 tile maps with two bytes per tile. If the FPGA didn't have any buffering, it would have to read one byte every four pixels; even with a 40ns static RAM, that would leave plenty of time for the CPU to update the display (an entire screen would only be 9600 bytes). The memory bandwidth for reading out the tile shapes would be no better than it is now, but that part of memory wouldn't have to be updated.
Incidentally, if one didn't want to add a 32Kx8 RAM but could add add 320 bytes of buffering/RAM(**) to the FPGA, one could use a tile-map approach but have the CPU or DMA feed 160 bytes to the display every 8 scan lines. That would burden the controller somewhat even when nothing on the display was changing, but could simplify the circuitry.
(*) The buffer could be implemented as RAM, or as a sequence of 32 40-bit-long shift registers plus a little control logic.
(**) The buffer could be implemented as two 160-byte RAMs, or as two groups of sixteen 80-bit shift registers.