At a system design level, there are four basic ways to do audio:
The first is to use the onboard ADC and/or DACs included with your microcontroller. On your LPC1313, you don't have a DAC; you'd have to upgrade to an LPC17xx to get this. You could also choose a different controller (scrap the LPC1313) which has the required onboard peripherals. This is a good choice if audio quality isn't a big deal for your project, space is a major constraint, and your processor has the required peripherals. This isn't a good choice if audio quality is extremely important, if you can't change microcontrollers. I'm not sure what your target application is, but if you're doing anything other than reproducing music, the 12-bit ADC on the LPC1313 should be fine.
The second would be to use a microprocessor which has the required peripherals as a slave processor, and communicate with the master over SPI or other protocol. This is a good idea if you need to do some preprocessing of your audio and your host doesn't have the bandwidth to do so: even cheap DSPs can efficiently and transparently do basic filtering before your host sees the data. This isn't a good idea if you're space or cost constrained, you'll likely waste a lot of silicon and board space on unused components on the slave chip. Parts like the Analog Devices ADAU17xx line blur the distinction between DSPs and codecs; an ADAU1781 would be a good choice for an audio frontend controller.
The third method is to build your own from discrete ADCs, DACs, and op-amps. This gives you the best control over the results, and if you want to spend the money you can build a 'perfect' system, but it's going to be difficult and expensive. You may be able to find an ADC specifically designed for audio, TI has 18 dedicated to this purpose.
The fourth and IMO best solution is to use a dedicated audio encoder/decoder. This type of chip is known as a "codec". The chip will integrate ADCs and DACs which are ideally suited for stereo audio into a single package, and you can address it over a serial link from your microcontroller. You may or may not need an amplifier, depending on your output. Examples would be parts like the NXP UDA1344 or the TI AIC3104. These are almost always a good choice, because they're simple to design for, conserve board space over discrete components, and offer very high quality. There's probably a codec chip in most of the audio devices you use. They can be expensive (though not when compared to an equivalent-quality system), and they don't do much without a dedicated host processor, but they're the standard choice.
It's hard to be specific without seeing your code, but I'll take a stab at it.
Your guess #1 is definitely true; any one-wire bus protocol has to have a lot of tolerance for timing variations, since it partly depends on the resistive pullup of a wire with unpredictable capacitance.
Your guesses #2 - #4 are unlikely; the crystals on both the microcontroller and the oscilloscope are fine.
What's probably happening is that the pulses you're generating with your firmware are not exactly the widths that you intended. An offset of a few instruction cycles could easily amount to 3 µs. And this error would not scale with the timing interval, so longer intervals will seem more accurate, as you observed.
If you post the code you're using to generate the pulses, we could give more specific advice.
Best Answer
Sounds interesting.
1 / 10ns = 100MHz. Pretty quick... Depending on what exactly "a few low latency calculations" entails, this may be a job for an FPGA (or an ASIC..)
Most modern FPGAs have max clock speeds in excess of 300MHz and come with things like with dedicated DSP blocks to make life easier. You can parallel (and pipeline if possible - probably not in this situation I guess) as necessary to achieve as fast a speed as possible. To figure out whether you can meet the timing or not, you could download one of the vendors (e.g. Xilinx, Altera, etc) IDEs, write your HDL and run a simulation.
You may need to look into some analogue trickery in combination with the FPGA, or attempt to start measurement earlier than impact (if possible) to give yourself more time.
More information about exactly what you are trying to do would probably help with the details (e.g. how exactly do you intend to affect the impact?)
EDIT - to answer the edited question, an FPGA (Field Programmable Gate Aray) is exactly the thing you are looking for if you want custom/specific calculations at high speeds.
Google (and here) has tons of info on them so I won't go into too much detail, but basically they are a large array of logic gates that you can connect up in any way you choose, effectively designing your own custom digital IC.
For example if you want a really fast FFT, you could literally devote the entire chip to optimising just for that one function. You can implement adders, counters, RAM controllers, SPI, UART, "soft core" processors, etc, etc, basically anything you want.
They are usually reprogrammable as many times as you want (depending on the technology used to hold the configuration - you can get RAM based, Flash based, antifuse, etc - most are RAM based) and have onboard RAM to use with your design.
I would suggest reading up a bit about them (The Design Warriors Guide to FPGAs is a pretty good book IIRC) and then grabbing a book on HDL, downloading an IDE and trying some stuff out in simulation. Then grab a starter board from Digilent (e.g. something like the Nexsys2) and away you go.
FWIW, Xilinx and Altera are the two big players in the market, so it's probably best to start with one of them. The Spartan series from Xilinx are popular and versatile FPGAs.
When you are ready to design your own boards, you will need a download cable to configure the FPGA via JTAG - these can be purchased from Xilinx/Altera or you can get much cheaper version on eBay that do the same.