The hsync and vsync signals are just relatively short, negative pulses that reset the CRT's electron beam to the left and top of the screen respectively.
Since CGA was basically just NTSC (AKA RS-170) with separate sync and components, the timing of the pulses should be the same. The hsync would occur about every 63.5 microseconds, and the vsync about every 16.7 milliseconds. The vertical timing should be well within the capabilities of an arduino, but the horizontals might be more challenging.
During the active horizontal scan you'd need to update the luminance according to the horizontal resolution you're designing to. To get 640 pixels, assuming you use about 53 us of the horizontal trace time to allow for HS duration, and margins to make sure your output doesn't run off the edges of the CRT, you need to output a new pixel about every 82 nanos,. Now, 82 ns is (probably) way to fast to get directly from an arduino, but if you employ an external 8-bit shift register, you only have to load that about every 660 ns, i.e., order of half a microsecond. Of course you could opt for 320 pixels and ease the timing further.
If meeting that kind of timing sounds reasonable to you, the exact numbers could easily be had via some light googling. For example, this looks like a pretty good example.
If you're just trying to overpower a video signal without gating it, your effective source impedance is going to vary every time your video is switched on and off. Not good. You need to ensure that both sides of your video connection see a uniform 75-ohm impedance whether you are passing through the signal or supplying your own. The easiest way to do that would be to use a pair of video amplifiers and put your circuit in the middle. The input amplifier should probably have 1:1 gain and a low-impedance output; use something like a 4066 to switch that and a white signal. The output amplifier should have a 2:1 gain and a 75 ohm resistor in series. BTW, are you doing anything for DC restoration? That could be an issue as well.
Best Answer
I used to work at Panasonic on their In-Flight Entertainment systems, so I know a bit about this kind of stuff. This description won't be 100% technically accurate (some naming might be a bit off) but I am trying to write it so anyone can understand it. Hopefully this explanation helps...
The "magic" behind it can be a combination of the following things: signal amplitude, frequency, and modulation. Different types of TVs and signals work differently. This is why old TVs had to have a converter box to accept the new digital signals if they only had an analog tuner. But that really just describes how the data is presented in the signal. Basically, the color data for each pixel is sent to the TV line by line, pixel by pixel and the TV refreshes the screen so many times per second with the new data. Even though the video is really just a lot of still images being updated on the screen, they change fast enough for us to perceive them as moving, hence the old term "moving picture."
Take a look at a typical "color bar" signal used to test video systems from Wikipedia.
The picture itself is divided into "lines" of pixels. Every screen has so many columns and so many lines, making up the total screen resolution. Each color in this picture is spread across numerous pixels of the same line. The accompanying oscilloscope waveform helps to describe what is going on here (This image is from Tektronix):
This image shows the data for two lines of pixels. Each line starts with a "sync pulse" to align the screen and the signal. This pulse (the negative part of the waveform) is followed by data for each pixel of the line. This is actually an analog video: the pixel data is represented by the amplitude and phase of the signal. You can see the various colors as an analog voltage with differing maximum and minimum voltages. When one line is finished, another sync pulse signals the start of the next line. The video signal and the screen need to have matching resolution (number of pixels per line). If there is extra data, it is dropped. If there is not enough data, the pixels share the data (makes the picture blocky).
Thanks to Pete B for mentioning this:
Digital signals are a bit different in that the signal is either HI or LO. The value of HI can vary between systems. There are different ways that this works. Sometimes, a known number of bits of data constitute a packet carrying all of the pixel data (similar to network communication). Another way is to time on long the signal is HI vs how long it is LO to represent a different Pixel value. This is kind of how IR TV remote controls work, although they are sending "control codes" instead of pixel information.
As you can imagine, this all takes place very, very quickly. A common TV in the United states is updated (screen refresh) 60 times per second (60Hz), or 30Hz for interlaced video. Although modern and HD TVs will typically refresh even more often (upwards of 240Hz). What this refresh rate means is that every pixel in the entire screen is updated so many times per second. The more it refreshes, the more detail is available in the picture, especially when there are a lot of fast-moving images in the video (like a chase sequence).
Different TV channels (AIR or Cable) are delivered to the TV in the same method, just with different base frequencies. The TV tuner will select one of these base frequencies to display (selecting a channel) and will update the pixels based on the modulated frequencies within the base carrier. The frequencies representing the pixel color data are much, much faster than the actual refresh rate of the screen because each pixel data has to be updated so many times per second, and there are, as you said, thousands of pixels.
Since humans only hear sound on a spectrum of 20Hz to 20kHz, the sound data can easily be added to the signal on top of the video and filtered out by the TV, although, for "high definition sound", the sound signal is sent through a separate wire to the TV to fit in all of the data.
To really understand what is going on, you have to comprehend signal frequencies, amplitude, time division, modulation, and spectrum analysis. But I hope this kind of explains some of it...