I think you just answered your own question. If you want to power 25 LEDs @ 20mA that's 500mA total. So no you can't drive or sink 500mA with a single pin when that pin can only supply 40mA. Try using a transistor on the output that can handle the current. I think you're on the right track there.
Maybe try something like this. Disclaimer though I only thought about that circuit for a few minutes, I didn't try it or sim it. But that's half the fun :)
First, let's say we work with these two functions:
\$L(n)\$ is the maximum amount of LEDs that can be driven from \$\mathsf{n}\$ pins.
\$p(n)\$ is the minimum amount of pins needed to drive \$\mathsf{n}\$ LEDs.
1:1-method
This one is easy:
$$L(n)=n$$
$$p(n)=n$$
A diode matrix
At first, we need to determine the most efficient diode matrix. For example, you could divide 4 pins into two sets of 2, or one set of 1 and one of 3. Obviously, the amount of LEDs is given by \$\mathsf{length\cdot{}width}\$. We can say \$\mathsf{width=n-length}\$, so the amount of LEDs is: \$\mathsf{length\cdot{}(n-length)=-length^2+n\cdot{}length}\$. Given an \$\mathsf{n}\$, this is a parabola, which has a maximum when \$\mathsf{length=\frac{n}{2}}\$. You can also do this on gut feeling. So, the maximum amount of LEDs is reached when the two sets have an equal amount of pins, or differ only 1, in case of an odd number of pins. We can now say:
$$L(n)=\lfloor{}\frac{n}{2}\rfloor{}\cdot\lceil{}\frac{n}{2}\rceil{}$$
Also, we can now easily understand the function \$\mathsf{p(n)}\$:
$$p(n)=
\begin{cases}
1&\text{ for }n=1\\
\lceil{}\sqrt{n}\rceil&\text{ for }n\gt1
\end{cases}$$
I just included the cases for 1, as this is a special case. Normally, you can just use the second function.
Charlieplexing
In this method, we have two LEDs between every set of two pins. We can calculate the amount of sets of two pins with:
$$(n-1)+(n-2)+\dots+1 = \frac{n\cdot(n-1)}{2}=\frac{n^2-n}{2}$$
Now we can say that:
$$L(n)=2\cdot\frac{n^2-n}{2}=n^2-n$$
We saw that the amount of pairs of pins equals \$\mathsf{n\cdot(n-1)}\$. With some reverse thinking, this leads to:
$$p(n)=
\begin{cases}
1 &\text{ for } n=1\\
2\cdot\lfloor\sqrt{n}\rfloor-1 &\text{ for } n\gt1
\end{cases}$$
I just included the cases for 1, as this is a special case. Normally, you can just use the second function.
Other methods
I'm not aware of any other methods, as of Tuesday march 12, 2013.
Best Answer
Generally speaking these led cubes use persistence of vision and multiplexing to minimise the number of wires and digital IO pins for a given size of cube, but in your case you can also use it to reduce the overall current drawn by the cube. Instead of just turning a LED on or off, you would flash the led many times per second to provide an apparently continuous brightness but at a much lower current draw.
For instance, if you can continuously power 16 LEDs but need to power 64 LEDs then you can do so if you have a duty cycle of 25%. Each LED will be powered on for a quarter of the time and powered off for three quarters of the time, so it will be 1/4 of the brightness of the continuously powered LED but also consume only 1/4 of the power.
If you can cycle between the 4 planes every every 5ms (50Hz) then depending on how bright the cube LEDs are people looking directly at the cube probably won't notice the flickering, but anyone looking elsewhere will probably notice the flickering out of the corner of their eye. At 2ms (125Hz) few people would notice the flickering, even out of the corner of their eye.
For more information, take a look at the wikipedia page on Flicker fusion threshold.