VGA monitors have a small serial EEPROM embedded into their circuit boards. The chip (known in the business as an "EDID" (electronic device ID)) connects to two pins on the HD15 connector. These two pins operate as an I2C bus that permits the driver software to query the monitor and find out what range of VGA picture resolution that the monitor supports.
The intended scheme is that the driver software on the host computer side will allow selection of a video resolution that is supported by the monitor. Once the monitor gets a video signal it has the capability of inferring the operating resolution by counting dot clocks per HSync and number of HSyncs per VSync. Once the operating resolution is inferred the monitor will switch itself to the best known method for itself to display that video mode.
Older monitors, back in the days of "MultiSync" CRT style monitors, may have only supported a couple of video resolutions. In some cases the video mode detect may have even been a simple R/C filter that could detect changes in the HSYNC frequency.
Newer monitors, including the plethora of LCD screens in use now, all have digital controllers in them that have built in circuits to detect video resolution. Most assuredly done by counting the pixels per line and/or lines per frame.
In effect, you're trying to recreate a color CRT controller with memory interface. This is perfectly possible, but it's much more involved than you realize. The physical implementation can be either an FPGA, as alex forencich suggests, or discrete chips. The discrete section will need something like the 74FCT series for horizontal timing, and can easily get by with 74HC for vertical timing.
First, as you realize, you'll have to generate display timing at 25 MHz - except that you won't. A 25 MHz VGA pixel clock implies that you're trying for 640 x 480 pixels, and this cannot be stored in a 64 kB RAM - it would require a 512 kB RAM. Instead, a 64 kB RAM will only support a 256 x 256 display, and this will only require a pixel clock of about 12.5 MHz. This is straightforward using a 9-bit binary synchronous counter, and can be realized with 3 74FCT161 counters. The vertical timing also uses a 9-bit counter, but 74HC161s can be used, since vertical timing is much slower than horizontal. The outputs of the two counters feed at least one static RAM, and there are at least 3 different approaches you can use for the interface.
1) FIFO - This is your first thought, but it's more complicated than you think. First, it only makes sense to transfer one byte (or 6 bits) of intensity data at a time, but you also have to store the address as well as the data. If you're going with a 64kB RAM, this means 16 bits of address along with 6 to 8 bits of intensity, and you'll need more than one FIFO. This in turn means that you'll need to ensure that the FIFOs remain synchronized. You'll also need to provide a mechanism to monitor the FIFO empty line and generate a write pulse to video RAM whenever the FIFO is not empty: that is, whenever there is data in the FIFO waiting to be written. Furthermore, you'll also need to provide a mechanism to keep memory writes from interfering with display reads. You can do this either by running the video RAM at 25 MHz, but alternating read and write cycles, or by permitting writes to RAM only during the non-display portions of the scan. This will occur during front porch, back porch, sync, or vertical blanking intervals.
2) Dual port RAM - Here's another device technology to look at. In this case you use the DPRAM as the video buffer, and feed one side from the video controller and the other from the Arduino. Be forewarned, a 64k x 8 DPRAM requires a package with a lot of pins.
3) Bank Switching - In this technique, you provide 2 video RAMs, and at any time one is being written to while the other is being read from. The state of the RAM is controlled by a flip-flop which can be triggered by the Arduino. So first (let's say) you read from bank A while writing to bank B. When you've completed writing a complete frame, you toggle the bank selector and the video is now read from B while A is being written to. This is in some ways more straightforward than the other two, but it does not permit local overwriting of areas of the image in the same way the other two approaches do.
Best Answer
640x480x60Hz uses a 25 MHz pixel clock so you'd need 1 byte every 40 ns (for 73 % of the time).
You also want to be able to write display data to your RAM so let's assume we want two RAM byte accesses in that 40 ns: one to read and display a pixel, one to R/W display data.
So you can either buy faster RAM or wider RAM. A 32-bit wide RAM using 55 ns chips could serve up four bytes in one 60 ns read and they'd last 160 ns.
This illustrates the arithmetic, you can explore the permutations further. If you're using an FPGA, you can do some clever prefetching stuff to burst-read multiple dwords and get more R/W slots in between them.