What you're thinking of doing could be accomplished with 74xx series medium scale integration (MSI) chips, such as were available in the late 1970's, and more easily with some of the more recent 74xx's that were added in the early 1980's. If you can find a databook for this series, especially a later, mid-1980's one, that has pin-outs and logic diagrams, it will be invaluable for such an endeavor.
If you can look at the specs for the 74LS374 and understand how it might serve as an 8-bit register, it shouldn't be too much of a stretch to see how you could connect these on buses so as to be able to move data around between them. Investigate the 74LS181 and you can basically get all the math capabilities of the venerable 8-bit 6502 CPU in one swoop. The trickiest part is the control circuitry to execute instructions by sequentially operating all the registers, buffers, and ALU in the right order. If you only took a couple of EE classes, this will probably be your major stumbling block.
To get oriented with this level of design, you might try something like "Microprogramming and Computer Architecture" by Bruce Segee and John Field. Used copies can be had for cheap at a certain on-line book seller.
It mostly has to do with the lengths of interconnections and propagation delays through the gates. If we reduce a CPU to its essence, it is a feedback machine. A bunch of combinatorial logic circuits compute some boolean functions over the current state of the machine, and those functions determine the new state, which is latched in by sequential circuitry when a new clock edge arrives. The combinatorial circuits all have delays. The clock period cannot be shorter than the time it takes for the slowest path through these gates to produce a stable result because a single incorrect bit stops the show.
Furthermore, the sequential logic has timing requirements. Before the clock edge arrives, there is some minimum setup time that the inputs have to be stable and then afterward they have to be stable for some hold time. If these are violated, the state becomes garbage.
The propagation delays are caused by things like how fast parasitic capacitances can charge, how fast current can build in the face of an inductance and how fast silicon devices can switch. For example, a bipolar transistor with a smaller base can switch faster than one with a larger base, so a tiny transistor on a chip will be faster than a discrete one.
In an earlier answer that I deleted, I wrote about transmission line effects. But I didn't consider that these effects don't even come into the picture at the speeds we are talking about because, say, at 10 Mhz, the wavelength is still about 30 meters. So on the scale of an ordinary sized circuit board, pulses on the time scale of a few megahertz still reach all parts of a copper network simultaneously.
So, if you make a CPU out of discrete components, you're simply not achieving the small components with fast switching times, and the same proximity which minimizes stray capacitances and inductances.
Nevertheless, ancient discrete-component machines in the 1960's did run quite a bit faster than these homebrew machines. It took some time and cunning to get there. For instance the IBM 360 Model 44 (1964) ran at 4 Mhz. That may still be "homebrew speed", but the CDC 7600 released just a few years later in 1969 surpassed 36 Mhz. The Wikipedia article http://en.wikipedia.org/wiki/CDC_7600 gives a hint at some of the tricks that were pulled, for instance:
"As always, Cray's design also focussed on packaging to reduce size, shorten signal paths, and thereby increase operating frequency. ... [E]ach circuit module actually consisted of up to six PC boards, each one stuffed with subminiature resistors, diodes, and transistors. The six boards were stacked up and then interconnected along their edges, making for a very compact, but basically unrepairable module."
So the homebrew CPU's are not necessarily built to their true potential due to some confounding effects having to do with the build quality and layout. Still, anyone who builds a CPU out of individual integrated circuits and discrete components which runs at several megahertz should be applauded.
Best Answer
The "simplest possible I/O" depends on what your I/O requirements are. It would be very simple, if you're in control of the CPU design, to implement a couple of special instructions to input and output a few bits of parallel I/O.
But since you mention keyboard and video, it seems that your requirement is to be able to interact with your system that way - and the dead simplest way to do that is to incorporate a fixed baud-rate serial port, and use a PC & terminal program to supply the keyboard and video. Using 9600 bps is fine for human console purposes, or if you expect to transfer binary, 115200 could handle 64K bytes in around 6 seconds.
An off-the-shelf UART chip like the 6850 or 8250 (well, out of the junk-box anyway) will need only 2 to 8 registers for a complete stdin/stdout solution. This is a tiny footprint compared to trying to implement a keyboard and video interface directly. You can probably drive a keyboard matrix with just a single 8-bit output port and a single 8-bit input port, but video will need 2K addresses just to supply the frame buffer for an 80x25 character display. If you consider so-called LSI solutions like the 8250 to be cheating, I would counter that constructing a UART from 74xx would probably be easier than constructing a video generator from the same technology.
That said, the down-side of the UART is that you are likely to lose input unless you either (a) poll it frequently enough, or (b) implement interrupts. For a simple proof-of-concept for your CPU, just polling should suffice - unless you're also trying to prove out interrupt capabilities of your CPU (if/when it has them), in which case, keeping up with serial input by using UART interrupts would be a fine test case.