Yes, you need a time reference. A timer interrupt which sets off the ADC conversion is usually the best bet. Its also possible to let the ADC free-run and use periodic (every n-th) reading. That makes for OK time data, if not great.
As for the frequency of sampling, what you need to figure out is the bandwidth that you need. Suppose the vibrations that ensue are have a frequency of x Hz, you would need atleast 2x Hz to reconstruct a vibration assuming its purely sinusoidal. This is sort of a theoretical limit. Since a pothole isn't going to produce a sinusoidal vibration, and theory doesn't always translate directly to real life, you'd probably need to sample at atleast 10x Hz.
For something like a car, I would guess the vibrations to be in the order of, say, 10Hz to 1KHz. (For reference, buildings generally vibrate at sub 1-Hz. Small metallic objects designed to withstand vibration and things like rockets would vibrate in the 50-200 Hz range. Tuning forks vibrate at Kilohertz, usually , like the 32.768 KHz tuning forks commonly used in electronics for time keeping, and quartz crystals vibrate in the MHz range). So a sampling rate of above 10KHz is porbably more than enough. You could try with less and see if it works, but I really haven't done any calculations for cars to be sure. I'm fairly confident 1KHz should be as far as the vibrations should go, and it'll probably be less than than in reality.
EDIT : If you're only really interested in the large scale vibration of the car, as you say 10-13Hz is what you are interested in, 100Hz should be enough. I'd sample at 1KHz to be safe.
You are not going to be able to distinguish potholes clearly from other short peak events apart from being able to distinguish between a rising bump in the road and a hole (the intial direction will be opposite) but you can certainly capture them quite easily.
Determine an initial direction (e.g. negative/positive XYZ depending on how your device is mounted), a threshold level, and a maximum time the reading should be over this level (determined by width of pothole) Then time the peak height/width and see if it fits your pothole characteristic.
The device already contains an internal 1kHz LPF, so you could add a HPF of say 50-200Hz for the potholes, since they will have a fast risetime. I'm not an expert on car vibration frequencies, but you will probably get some noise from vibration however you filter. However that's not an issue as long as the pot hole event is large in comparison with the noise - it looks like the data is okay as it is, I would just sample a bit faster to prevent aliasing (e.g. >2kHz) or add a LPF to the existing internal one as described in the datasheet. Since you are trying to capture fast risetime events, I'd go with the former (faster sampling, possibly with HPF)
To compensate for a change in inclination, you can have a running average value which can be used to zero the axis out (one for each axis). Also, note that a HPF will ignore the DC level, so (as long as it doesn't go off the end of the scale) a slow gradient will make no difference.
According to the datasheet (bottom of page 7 in the link above), the formula for the external capacitance is:
\$ C2 = C3 = C4 = \dfrac{4.97 \times 10^{-6}}{f_{BW}} \$
so your calculation of:
\$ \dfrac{4.97 \times 10^{-6}}{10Hz} = 497nF \$ is correct.
Best Answer
That's a very bimodal distribution - not something you should see from vibration. My guess is you are either not reading the right data from the chip or you are interpreting it incorrectly. This part uses two byte values for the outputs - it looks like you may have the MSB and LSB swapped.
Edit: I'm actually going to say that you almost certainly have the MSB and LSB swapped. The sensitivity on the 2g scale is 256 LSB/g. This means that you should be seeing a raw reading of + or - 256 or so, assuming the chip is mounted with the axis vertical. The LSB will be mostly 'noise' while the MSB will be mostly zero. If the bytes are swapped, then the sign bit becomes what should have been bit 7 and it will flip more or less randomly. If the original value is 255, byte swapping results in -256. If the original value is 256, byte swapping results in value of 1. If the original value is 127, byte swapping results in 32,512. If you are dividing by the sensitivity, you should see a highly bimodal distribution with lot of points with small negative values and a lot of points with large positive values.