How to solve speed issues in a double buffered + video overlay system

video

I've decided to start a small project over this break for creating a very simple GPU. As far as the GPU will go for now is simply receiving pixels and outputting pixels to SDRAM. In addition, there is another part of the circuit that will get pixel data from the SDRAM and output it to an LCD (I've already made the timing controller for the LCD, just need an interface to the SDRAM which makes sense). The last part somehow does a video overlay with a stream from some sort of CCD/CMOS sensor. As for the SDRAM, I've made controllers before and have enough of an understanding of VHDL to do this, so that's a non-issue. My main problem comes in with the timing.

First off, the LCD runs at 9MHz so that's fine so it takes ~111ns per pixel with a 25ns setup time before the falling edge. Suppose I'm running the SDRAM at 100MHz. After a burst setup, I can get 10ns per pixel, so as long as we're in a burst, there's more than enough time to get the data. But this speed is only in a burst. If I were to load the row every single time, I would be cutting it close. This means that during the time where I actively need to read data for the LCD, I am completely held by it. No writes.

The only time I can do the writes are during the columns (~480 col) ~4100ns of Hsync in each line (~280 lines) and the ~525000ns during Vsync. Suppose that worst case, my GPU needs to write to every pixel of the screen. This means that I can do ~410 pixels per Hsync, which leaves maybe a hundred or so pixels left that I need to write during the Vsync. This write data will be writing to a 2nd frame area in the memory (hence, double buffering).

That timing works out, … assuming my GPU has its own (large) pixel cache during it's waiting times and so on so forth. That's an issue I'm going to deal with later. My main issue right now is the video overlay. In addition to what's currently happening, I have my pixel data from the CCD/CMOS. After my worst case GPU usage, I have less than ~525000ns to write 480×280 (~1344400ns) which is definitely impossible by an order of a bit less than 3.

Am I doing something terribly wrong? How does the computer I have in front of me right now do this? The display is SO much larger and I doubt that SDRAMS with a MUCH shorter latency exist (maybe DDR). 🙁

Best Answer

I see two questions here, each with it's own answer.

First how to multiplex the SDRAM. My suggestion would be to grab a bunch of output pixels from the SDRAM in a burst, stuff them in a smaller external memory or FIFO (easy if your logic implementation platform is an FPGA) and then feed them to the LCD at its slower rate, during which time the GPU can use the SDRAM. When you run out of queued pixels, claim the SDRAM back from the GPU and get another bunch.

Second, how to do the overlay. Quite simply, for best performance don't commit the overlay data to the SDRAM, instead, put it in it's own memory and multiplex it in front of (or transparently mixed with) the output of the SDRAM under the control of a state machine with programmable size parameters.