In USB, guaranteed bandwidth also implies bounded latency, but not the other way around.
USB is organized into 1 ms time slices. For interrupt transfer, the host is guaranteed to send a OUT packet to the device each slice. Maybe it can be configured for once every N slices, I don't remember. Interrupt packets are short, so you don't get to transfer much data, but you know the host will come around and ask periodically.
For isochronous, a fixed part of each 1 ms time slice is allocated to the device. Not only will the host send the OUT packet each slice, but you can configure it to contain a specific amount of payload, or the IN reply packet to have a specific payload. Obviously this allocates some of the fixed bandwidth to a particular device, so all devices can't have this with arbitrarily large data. Since the available resource is finite, the host can refuse your device altogether. This is one of the drawbacks of insisting on a fixed bandwidth.
There are also rules about how much of each 1 ms slice the host can allocate to interrupt and isochonous tranfers. I think something like 20% must be left unallocated, which means the host will poll bulk transfer devices during this time. There can be any number of bulk devices, so there is no guarantee how often the host will get around to polling any one device.
In most cases, the interrupt and isochronous transfers don't add up to much, so in practise most of the time is left over for bulk devices. Usually the host will poll all bulk devices in a loop during any left over unallocated time. If you're the only device on the bus, then bulk transfer will give you access to most of the bandwidth, whereas interrupt and isochronous still get the small dedicated bandwidth they are configured for.
Unless you really need some minimum bandwidth or latency, just use bulk transfers.
Nyquist showed you have to sample at a rate at least twice the highest frequency you care about. This captures the information in your signal, but also causes artifacts from the frequencies above half the sample rate to show up in your sampled signal. These are called aliases. You therefore need to first eliminate the frequencies that will cause aliases, then sample.
Since no filter has a infinitely sharp cutoff, there will be some frequency range above the highest frequency you care about and below the frequency the anti-aliasing filter attenuates enough for you to get the signal to noise ratio you care about.
Analog filters are usually fairly gentle in their falloff. One approach is to apply a slow-falloff analog filter, sample at a high rate, then digitally filter that with a sharp filter to allow re-sampling at a lower rate. That last step is often called decimation.
For example, let's say you are after good quality voice and you're highest frequency of interest is 8 kHz. You might put a two-pole R-C filter on the signal with each pole at 12 kHz. You might sample the result at 100 kHz, which means anything past 50 kHz had better be attenuated below your noise floor. The analog filter will reduce 50 kHz by 25 dB, which you decide is good enough in this case since you know there will be very little content above 50 kHz to start with.
Theoretically you can take this 100 kHz sample stream and decimate it to 16 kHz, since that's twice the highest frequency you care about. Even a sharp filter, like convolving with a 1000 point sinc, needs some room to work with. Let's say 1/2 octave (that's really sharp), so the absolute minimum sample frequency after decimation would be 23 kHz (8 kHz plus 1/2 octave is 11.3 kHz, times 2 is 22.6 kHz).
You gave no spec on what kind of sound you want to sample, so you'll have to extrapolate to your requirements on your own.
Best Answer
USB2.0 full speed is the same as USB1.1 full speed.
USB1.1 frames have a length of 1ms, and isochronous transfers may put up to 1023 bytes into a frame. This gives you roughly 170 stereo 24bit samples per frame – a 170ksample/s rate.
Even the simplest compression techniques as transferring the difference to the previous value instead of the actual value or joint-stereo, followed by special encoding for small difference values may nearly double that rate.