Anti-saturation diodes are connected in parallel to the C-B-diode of the transistor that is to be kept from saturation. You are doing this correctly at the npn (anode at base and cathode at collector), and it should be done exactly the same way at the pnp, just that the diode is the other way round in this transistor: cathode at base, anode at collector.
I am not really sure how you chose your base resistors. I assume you have a supply voltage of 5 V and a rectangular base drive signal (0 V, 5 V). I would suggest you use identical values for both base resistors. With 5 k\$\Omega\$, it is likely that the high value of the base resistor does more harm than an anti-sat-diode would do good. Something in the range of 200...500 \$\Omega\$ for each resistor seems better to me.
If you want to push the speed even further, you can try paralleling the base resistors with small (approx. 22 pF) capacitors. The trick about finding the right value for the capacitor would be to make it somewhat equal to the effective capacitance at the base, thus forming a 1:1 voltage divider for the high frequency part of the rising or falling voltage edge.
Edit #1:
Here is the schematic I used to check with LT Spice. The input signal (rectangular, 0 V and 5 V) is fed into three similar BJT inverters, each using a complementary BC847 and BC857 pair. The one on the left has no special tricks to speed it up, the one in the middle uses Schottky diodes for anti-saturation and the one on the right also features a high-speed bypass along each base resistor (22 pF). The output of each stage has an identical load of 20 pF, which is a typical value for some trace capacitance and a subsequent input.
The traces show the input signal (yellow), the slow response of the circuit on the left (blue), the response with anti-saturation diodes (red) and the response of the circuit that also uses capacitors (green).
You can clearly see how the propagation delay gets less and less. The cursors are set at 50 % of the input signal and at 50 % of the fastest circuit's output and indicate a very small difference of 3 ns only. If I find the time, I might also hack the circuit and add real scope pictures. Careful layout will definitely be necessary to achieve sub-10 ns delay times in reality.
Edit #2:
The breadboard works nicely and shows a delay of < 10 ns on my 150 MHz scope. Pictures will follow later this week. Had to use my good probes, because the cheapo ones showed not much more than ringing...
Edit #3:
Ok, here's the breadboard:
A 1 MHz square wave with 5 V (pkpk) enters the board from the left through the BNC connector and gets terminated into 50 \$\Omega\$ (two paralleled 100 \$\Omega\$ resistors, upper one hidden by probe). Base resistors are 470 \$\Omega\$, capacitors are 30 pF, Schottky diodes are BAT85, transistors are BC548/BC558. The supply is bypassed with 100 nF (ceramic) and a small electrolytic capacitor (10 \$\mu\$F).
The first screenshot shows the input and output waveforms at 100 ns/div and with 2 V/div for both traces. (Scope is a Tektronix 454A.)
The second and third screenshot show the transitions from low to high and from high to low at the input with 2 ns/div (20 ns time base with additional 10 x horizontal magnification). The traces are now centered vertically on the screen for an easier display of the propagation delay with 1 V/div. The symmetry is very good and shows a difference of < 4 ns between input and output.
I would argue that we can actually trust the simulated results.
The rise and fall times are very likely faster in reality and just limited by the scope's rise time, but I can think of no reason why the delay between the two signals should not be displayed correctly.
There is one thing to pay attention to: With every low-to-high and high-to-low transition, the two transistors tend to cross-conduct very briefly. At higher frequencies of the input signal (approx. > 2 MHz), the inverter circuit starts to take a lot of current and does weird things...
Trying to evaluate your fall time with the load disconnected is not a useful measurement. As the PNP transistor turns OFF its collector terminal goes to a high impedance. That leaves the load to be the scope probe impedance which is likely to be 1Meg or 10Meg ohms. It is no wonder that it takes a long time for that size of load to discharge the small capacitance of the collector circuit and the scope probe capacitance.
Instead you should be looking at your switching time with the actual load hooked up. If the actual load is not available then at least add a test load resistor from the PNP collector to the GND rail. Size the resistor to dissipate power equivalent to the normal load that the circuit is meant to drive. Doing this you will see the real fall time and rise time of the circuit.
Update:
To get my point across take a look at the LT Spice circuit simulation below. I've drawn the circuit with a load consisting of a 10Meg ohm scope probe in parallel with 22pF of capacitance to simulate your actual scope probe capacitance plus some for the PNP collector terminal and the wiring. These pictures clearly show that the long fall time you are seeing is due to the fact that you have the load disconnected as I said above. It has next to nothing to do about the LED characteristics or the PNP saturation as another answer suggests.
Best Answer
The solution to DC accuracy revolves around the differential input amplifier: -
In a nutshell the forward volt drops across each base-emitter junction cancel out and what is amplified is the difference between Vin+ and Vin- with no offset.
Bottom line is that only in exceptional cases will the discreet version be used; almost every application of this type is served by the ubiquitous op-amp.