Java – What does interleaved stereo PCM linear Int16 big endian audio look like

audioaudioformatinterleavejavapcm

I know that there are a lot of resources online explaining how to deinterleave PCM data. In the course of my current project I have looked at most of them…but I have no background in audio processing and I have had a very hard time finding a detailed explanation of how exactly this common form of audio is stored.

I do understand that my audio will have two channels and thus the samples will be stored in the format [left][right][left][right]…
What I don't understand is what exactly this means. I have also read that each sample is stored in the format [left MSB][left LSB][right MSB][right LSB]. Does this mean the each 16 bit integer actually encodes two 8 bit frames, or is each 16 bit integer its own frame destined for either the left or right channel?

Thank you everyone. Any help is appreciated.

Edit: If you choose to give examples please refer to the following.

Method Context

Specifically what I have to do is convert an interleaved short[] to two float[]'s each representing the left or right channel. I will be implementing this in Java.

public static float[][] deinterleaveAudioData(short[] interleavedData) {
    //initialize the channel arrays
    float[] left = new float[interleavedData.length / 2];
    float[] right = new float[interleavedData.length / 2];
    //iterate through the buffer
    for (int i = 0; i < interleavedData.length; i++) {
        //THIS IS WHERE I DON'T KNOW WHAT TO DO
    }
    //return the separated left and right channels
    return new float[][]{left, right};
}

My Current Implementation

I have tried playing the audio that results from this. It's very close, close enough that you could understand the words of a song, but is still clearly not the correct method.

public static float[][] deinterleaveAudioData(short[] interleavedData) {
    //initialize the channel arrays
    float[] left = new float[interleavedData.length / 2];
    float[] right = new float[interleavedData.length / 2];
    //iterate through the buffer
    for (int i = 0; i < left.length; i++) {
        left[i] = (float) interleavedData[2 * i];
        right[i] = (float) interleavedData[2 * i + 1];
    }
    //return the separated left and right channels
    return new float[][]{left, right};
}

Format

If anyone would like more information about the format of the audio the following is everything I have.

Format is PCM 2 channel interleaved big endian linear int16
Sample rate is 44100
Number of shorts per short[] buffer is 2048
Number of frames per short[] buffer is 1024
Frames per packet is 1

Best Answer

I do understand that my audio will have two channels and thus the samples will be stored in the format [left][right][left][right]... What I don't understand is what exactly this means.

Interleaved PCM data is stored one sample per channel, in channel order before going on to the next sample. A PCM frame is made up of a group of samples for each channel. If you have stereo audio with left and right channels, then one sample from each together make a frame.

Frame 0: [left sample][right sample]
Frame 1: [left sample][right sample]
Frame 2: [left sample][right sample]
Frame 3: [left sample][right sample]
etc...

Each sample is a measurement and digital quantization of pressure at an instantaneous point in time. That is, if you have 8 bits per sample, you have 256 possible levels of precision that the pressure can be sampled at. Knowing that sound waves are... waves... with peaks and valleys, we are going to want to be able to measure distance from the center. So, we can define center at 127 or so and subtract and add from there (0 to 255, unsigned) or we can treat those 8 bits as signed (same values, just different interpretation of them) and go from -128 to 127.

Using 8 bits per sample with single channel (mono) audio, we use one byte per sample meaning one second of audio sampled at 44.1kHz uses exactly 44,100 bytes of storage.

Now, let's assume 8 bits per sample, but in stereo at 44.1.kHz. Every other byte is going to be for the left, and every other is going to be for the R.

LRLRLRLRLRLRLRLRLRLRLR...

Scale it up to 16 bits, and you have two bytes per sample (samples set up with brackets [ and ], spaces indicate frame boundaries)

[LL][RR] [LL][RR] [LL][RR] [LL][RR] [LL][RR] [LL][RR]...

I have also read that each sample is stored in the format [left MSB][left LSB][right MSB][right LSB].

Not necessarily. The audio can be stored in any endianness. Little endian is the most common, but that isn't a magic rule. I do think though that all channels go in order always, and front left would be channel 0 in most cases.

Does this mean the each 16 bit integer actually encodes two 8 bit frames, or is each 16 bit integer its own frame destined for either the left or right channel?

Each value (16-bit integer in this case) is destined for a single channel. Never would you have two multi-byte values smashed into each other.

I hope that's helpful. I can't run your code but given your description, I suspect you have an endian problem and that your samples aren't actual big endian.

AudioStreamBasicDescriptor

Apple's documentation for the ASBD is here. To clarify:

A frame of audio is a time-coincident set of audio samples. In other words, one sample per channel. For Stereo this is therefore 2.
For PCM formats, there is no packetisation. Supposedly, mBytesPerPacket = mBytesPerFrame, mFramesPerPacket=1 but I'm not sure whether this is actually ever used.
mReserved isn't used and must be 0
Refer to The documentation for mFormatID and mFormatFlags. There is a handy helper function CalculateLPCMFlags in CoreAudioTypes.h for computing the latter of these in CoreAudioTypes.h.
Multi-channel audio is generally interleaved (you can set a bit in mFormatFlags if you really don't want it to be).
There's another helper function that can fill out the entire ASBD - FillOutASBDForLPCM() for the common cases of linear PCM.
Lots of combinations of mFormatID and mFormatFlags are not supported by remoteIO units - I found experimentation to be necessary on iOS.

Here's some working code from one of my projects:

AudioStreamBasicDescription inputASBL = {0}; 

inputASBL.mSampleRate =          static_cast<Float64>(sampleRate);
inputASBL.mFormatID =            kAudioFormatLinearPCM;
inputASBL.mFormatFlags =         kAudioFormatFlagIsPacked | kAudioFormatFlagIsSignedInteger,
inputASBL.mFramesPerPacket =     1;
inputASBL.mChannelsPerFrame =    2;
inputASBL.mBitsPerChannel =      sizeof(short) * 8;
inputASBL.mBytesPerPacket =      sizeof(short) * 2;
inputASBL.mBytesPerFrame =       sizeof(short) * 2;
inputASBL.mReserved =            0;

Render Callbacks

CoreAudio operates what Apple describe as a pull model. That is to say, that the render call-back is called form a real-time thread when CoreAudio needs the buffer filling. From your question it appears you are expecting the opposite - pushing the data to the audio output.

There are essentially two implementation choices:

Perform non-blocking reads from the UDP socket in the render callback (as a general rule, anything you do in here should be fast and non-blocking).
Maintain an audio FIFO into which samples are inserted when receive and consumed by the render callback.

The second is probably the better choice, but you are going to need to manage buffer over- and under-runs yourself.

The ioData argument points to a scatter-gather control structure. In the simplest case, it points to one buffer containing all of the frames, but could contain several that between them have sufficient frames to satisfy inNumberFrames. Normally, one pre-allocates a buffer big enough for inNumberFrames, copies samples into it and then modifies the AudioBufferList object pointed to buy ioData to point to it.

In your application you could potentially a scatter-gather approach on your decoded audio packets, allocating buffers as they are decoded. However, you don't always get the latency you wanted and might not be able to arrange for inNumberFrames to be the same as your decoded UDP frames of audio.

Edit: If you choose to give examples please refer to the following.

Best Answer

Related Solutions

Iphone – Extracting audio channel from Linear PCM

Ios – Setting up an Audio Unit format and render callback for interleaved PCM audio

AudioStreamBasicDescriptor

Render Callbacks

Related Topic