I would suggest that you should use a shift-register chip that has decent current-sinking capability for the cathodes, and use a shift-register chip to drive discrete transistors for the anodes. Perhaps wire the matrix as 7x26 and use two TLC5925 chips for the columns, and use a 74HC164 or equivalent to drive seven nice beefy transistors for the rows.
Actually, it may be a good idea to rig up the rows with a counter chip and a 555-timer wired so that they will automatically scan, but the main processor can 'nudge' the timer when it's almost ready for its next count. Such a circuit could ensure that no matter what the processor did, it would not be possible for a row to be energized much more than 1/5 of the time (the processor could strobe six rows quickly, then linger on the seventh, then six rows quickly, linger on the seventh, etc. but the hardware would limit what fraction of the time would be spent on any one row even in a worst-case situation.
The best example I can think of is the "Peggy," A Light Emitting Pegboard Display. It is a 25x25 LED matrix display driven by an ATmega168 (which is pin compatible with the ATmega328)
The wiki page has a lot of good information. Including a detailed schematic.
There are few things to notice in their layout.
For one, they use a row common anode setup. That is the current source is on the row, and sink on the column. You have yours in row common cathode. There is nothing inherently right or wrong with either layout. Just something to keep in mind when designing your circuit. If using discrete leds, it just means flipping the led connections. If using a prebuilt LED matrix, it is something important to know. (I'll assume you can easily swap the order to match the peggy schematic. If not, just swap column for row in your head)
They use 74HC154 4-16 decoder/demux chips for row select. Since you only need 10 rows (or cols) you can get away with just one. Of course, there is the issue of current. In your case, at 10 x 30mA = 300mA minimum. To solve that problem they used 2STX2220 PNP transistors which will be able to source up to 1.5A per row. A bit over kill in your case. Since you will just use these as row select switches, just about any other pnp transistor that can source your max current should work just as well. Take a look at Transistor Circuits to figure out what resistor values you'll need for full on/off operations.
On the Peggy board, for the column sink driver they use an STP16DP05. But I have found these difficult to find and expensive. There are many other alternatives like the TLC5916 These use a serial input, and can be easily cascaded. If not, a digikey of mouser search for led sink driver will yield many results.
Alternatively, since you already have ULN2803 arrays, you could use two of these with a single current limit resistor per column. That's a lot of pins, so you'll have to get creative, but it could work for the column sink as well.
Avago published a nice application note titled "Introduction to Driving LED Matrices". It covers this and a few other things.
Best Answer
You can easily do this in a small FPGA, provided that you use a shift registers like a 74xx595 to reduce the number of FPGA I/O Pins. Even the smallest Xilinx Spartan-3 could do it. Now, what you want to do with a 70x70 LED display might place additional demands on the FPGA size.
If you wanted to not use an external shift register(s), then you need a lot of pins on the FPGA. A minimum of 140 I/O pins, but possibly 210 or more. That still is not terrible, but will push you into a BGA which you might not want to deal with at this point.
With external shift registers, you might be able to use a single MCU. The trick here is that you want to connect the shift registers to some sort of SPI serial port, and then use DMA to feed the SPI interface. You don't want to bit-bang the serial interface for this. A typical 8-Bit Micro is probably not going to be enough for this. There are lots of ARM Cortex M0, M3, and M4's that are super cheap and would be a good option.
You could also use external latches (74xx374) instead of shift registers with an MCU. You could probably bit-bang an interface to latches, but using shift registers and SPI with DMA would be superior.
I have done a lot of multi-CPU systems and I can safely say that you do NOT want to do this if you can avoid it. There are a lot of synchronization issues that creep up when using several CPU's, and if you don't know what to pay attention to then you will likely just end up with a sub-par result and a lot of frustration.