Rule-of-thumb says you have to calculate transmission line effects in if your connection length is longer than 1/10th of the signal's wavelength.
Transmission lines will cause reflections where they show a sudden change in impedance. The reflected signal adds to the original, may reflect again at the transmitter side, and that way go back and forth. The result is shown in the graph: the overshoot you're talking about, and some ringing.
A 12" wire length (30cm) is 1/10 of a 3m wavelength, or 100MHz. 1MHz, and even 4MHz shouldn't give much problems here. Terminating the line, like Wouter says, could be a first measure, though. The input is probably high impedance, and that's never a good matching. Since you have three different pieces in your connection (PCB, wire, PCB) finding the characteristic impedance will be difficult. (It will also be different for the PCB and the wire, so at very high frequencies you'd also get reflections at the cable's connections.) Given the short length and the low frequency the value of your terminating resistor is not critical. 100\$\Omega\$ may be too low for the microcontroller, most of them can't supply 33mA (or even 50mA, at 5V). I would try a 1k\$\Omega\$ resistor to start with, and see how far that gets us.
edit (re question update)
Rise time seems to be 7ns. That's fast, like Kortuk says, it means you have a spectrum at least to 400MHz, and those harmonics will indeed suffer transmission line effects, even if your clock is only 1MHz. Try to filter them out, a 20MHz bandwidth (80MHz for the 4MHz clock) gives you more than enough rise time. This is a 1MHz square wave filtered with a brick wall LPF at 20MHz:
Placing a series resistor will form a first order LPF with the line's capacitance. If we estimate that at 50pF then
\$ R = \dfrac{1}{2 \pi \cdot 100 MHz \cdot 50 pF} =32 \Omega \$
gives you cutoff frequency of 100MHz. So a 33\$\Omega\$ series resistor should decrease your rise time, but leave more than enough to have a good signal at 4MHz.
I don't agree with Madmanguruman's statement: "For both 'normal' I2C and 'normal' SPI, each slave needs a direct connection to the master, so all the slaves are in parallel regardless of the physical placement of the devices." Shift registers are very often wired in series, which makes a big loop that requires several shifts to circulate the data. That's just one example of a valid series connection and there are many more. Parallel is likely more common as I'll explain below whenever the slave devices aren't simple or literally shift registers. How you do it completely depends on the requirements of your design as well as the functionality of your slave devices.
More details:
Using the word series does actually make sense. What you're saying is the MOSI (master out / slave in) pin on the master is connected to the serial data input of slave 1. The serial data output of slave 1 is connected to the serial data input of slave 2 and the serial data output of slave 2 is connected to the serial data input of slave 3. Finally, to complete the loop (assuming you want bidirectional communication), you would connect the serial data output of slave 3 to the serial data input of the master. To make this work, the clock generated by the master has to be connected to every slave. Additionally, every slave will likely need to be on the same chip select signal to make sure the SPI hardware will respond. To read from all three slaves, you'd have to send three transmissions from the master.
If you connect everything in parallel, now you need a discrete chip select for each slave. The capacitance of the data signals should be trivially higher in this configuration and the clock signal capacitance is unlikely to be drastically different.
As far as routing goes, it probably won't matter as long as you're staying below 100 MHz or so. Higher frequencies start to introduce other issues. Regardless, there are plenty of reasons to go either way. It can certainly depend on your firmware constraints. Edit: As pointed out by MartinThompson, if your route is long with a strong (fast edge rate) driver, it's possible that reflections can cause issues. If you're making a multi-drop route and you expect fast edge rates, terminating the signal should significantly help. This could be another conversation, but suffice it to say that you can add series or ground referenced terminators to absorb the energy and prevent reflections.
For example, many microcontrollers will allow you to interact with SPI devices using DMA. If all of the slaves are on the same chip select, it's less code to communicate with all of them, but the latency is higher as you have to shift more data to complete the transaction. Slaves can also signal that they want to transmit, but since the master generates the clock the series connection can cause some headaches here. If you don't care about that use case, though, then wiring in series will simplify the DMA configuration. It's possible to use DMA to communicate with multiple peripherals by embedding the chip select in the data, which causes the DMA controller to handshake with the appropriate slave. You may not even have DMA, but I thought I should explain this to give you an example of why a series configuration may be advantageous in some circumstances. DMA trivializes parallel connections as well, to be fair, and quite frequently parallel connections are much more desirable in this type of situation.
Wiring in parallel is probably the more traditional way to do it based on my experience if the slave device functions at a higher level than simple shifting. I wire shift registers in series all the time, but I frequently attach multiple slave microcontrollers on the SPI bus in parallel so they each get their own chip select and slave select (the slave select is allocated for slave -> master communication, which basically says "hey I need to send data, give me a clock").
Lastly, you can't wire I2C in series unless you want to use different I2C buses for downstream devices. I2C requires that all devices be connected in parallel. The only thing you have to care about electrically is that you picked the right value for the external pull-up resistor. Because there can be many devices (7 bits worth) attached to the I2C bus, a weak pull-up could cause the bus signals to be metastable from the perspective of everything that's watching the bus for incoming data. You want to make sure the SDA line is firmly high before the next clock cycle and you do that by properly sizing the pull-up resistor.
Best Answer
To answer your main question: No, SPI TTL or CMOS signals can not reliably drive 128 loads via off-board cables. There is no easy calculation you can do to show this. It has to do with both loading, reflections and ground references. You can simulate the propertis of such a system, but it is a complicated waste of time.
What might work reasonably well is to use RS422 drivers and receivers specified for as many loads as you need. You also need to terminate each differential pair at the furthest end. If you have multiple SPI slaves per board, then they can share RS422 receivers. You also probably want to at least partially daisy chain your SPI slaves. For example, all slaves on a single board could form a chain. It will greatly reduce your cabling.
This way you will have differential signaling in your cables, which can handle more loads and is a way more reliable signaling scheme in cables. It does however double you number of wires..
But a disclaimer is in order: No one in their right mind would design a system like this with 128+3 signaling lanes.