Ios – Encoding PCM (CMSampleBufferRef) to AAC on iOS – How to set frequency and bitrate

aacaudioaudiotoolboxcore-audioios

I want to encode PCM (CMSampleBufferRef(s) going live from AVCaptureAudioDataOutputSampleBufferDelegate) into AAC.

When the first CMSampleBufferRef arrives, I set both (in/out) AudioStreamBasicDescription(s), "out" according to documentation

AudioStreamBasicDescription inAudioStreamBasicDescription = *CMAudioFormatDescriptionGetStreamBasicDescription((CMAudioFormatDescriptionRef)CMSampleBufferGetFormatDescription(sampleBuffer));

AudioStreamBasicDescription outAudioStreamBasicDescription = {0}; // Always initialize the fields of a new audio stream basic description structure to zero, as shown here: ...
outAudioStreamBasicDescription.mSampleRate = 44100; // The number of frames per second of the data in the stream, when the stream is played at normal speed. For compressed formats, this field indicates the number of frames per second of equivalent decompressed data. The mSampleRate field must be nonzero, except when this structure is used in a listing of supported formats (see “kAudioStreamAnyRate”).
outAudioStreamBasicDescription.mFormatID = kAudioFormatMPEG4AAC; // kAudioFormatMPEG4AAC_HE does not work. Can't find `AudioClassDescription`. `mFormatFlags` is set to 0.
outAudioStreamBasicDescription.mFormatFlags = kMPEG4Object_AAC_SSR; // Format-specific flags to specify details of the format. Set to 0 to indicate no format flags. See “Audio Data Format Identifiers” for the flags that apply to each format.
outAudioStreamBasicDescription.mBytesPerPacket = 0; // The number of bytes in a packet of audio data. To indicate variable packet size, set this field to 0. For a format that uses variable packet size, specify the size of each packet using an AudioStreamPacketDescription structure.
outAudioStreamBasicDescription.mFramesPerPacket = 1024; // The number of frames in a packet of audio data. For uncompressed audio, the value is 1. For variable bit-rate formats, the value is a larger fixed number, such as 1024 for AAC. For formats with a variable number of frames per packet, such as Ogg Vorbis, set this field to 0.
outAudioStreamBasicDescription.mBytesPerFrame = 0; // The number of bytes from the start of one frame to the start of the next frame in an audio buffer. Set this field to 0 for compressed formats. ...
outAudioStreamBasicDescription.mChannelsPerFrame = 1; // The number of channels in each frame of audio data. This value must be nonzero.
outAudioStreamBasicDescription.mBitsPerChannel = 0; // ... Set this field to 0 for compressed formats.
outAudioStreamBasicDescription.mReserved = 0; // Pads the structure out to force an even 8-byte alignment. Must be set to 0.

and AudioConverterRef.

AudioClassDescription audioClassDescription;
memset(&audioClassDescription, 0, sizeof(audioClassDescription));
UInt32 size;
NSAssert(AudioFormatGetPropertyInfo(kAudioFormatProperty_Encoders, sizeof(outAudioStreamBasicDescription.mFormatID), &outAudioStreamBasicDescription.mFormatID, &size) == noErr, nil);
uint32_t count = size / sizeof(AudioClassDescription);
AudioClassDescription descriptions[count];
NSAssert(AudioFormatGetProperty(kAudioFormatProperty_Encoders, sizeof(outAudioStreamBasicDescription.mFormatID), &outAudioStreamBasicDescription.mFormatID, &size, descriptions) == noErr, nil);
for (uint32_t i = 0; i < count; i++) {

    if ((outAudioStreamBasicDescription.mFormatID == descriptions[i].mSubType) && (kAppleSoftwareAudioCodecManufacturer == descriptions[i].mManufacturer)) {

        memcpy(&audioClassDescription, &descriptions[i], sizeof(audioClassDescription));

    }
}
NSAssert(audioClassDescription.mSubType == outAudioStreamBasicDescription.mFormatID && audioClassDescription.mManufacturer == kAppleSoftwareAudioCodecManufacturer, nil);
AudioConverterRef audioConverter;
memset(&audioConverter, 0, sizeof(audioConverter));
NSAssert(AudioConverterNewSpecific(&inAudioStreamBasicDescription, &outAudioStreamBasicDescription, 1, &audioClassDescription, &audioConverter) == 0, nil);

And then, I convert every CMSampleBufferRef into raw AAC data.

AudioBufferList inAaudioBufferList;
CMBlockBufferRef blockBuffer;
CMSampleBufferGetAudioBufferListWithRetainedBlockBuffer(sampleBuffer, NULL, &inAaudioBufferList, sizeof(inAaudioBufferList), NULL, NULL, 0, &blockBuffer);
NSAssert(inAaudioBufferList.mNumberBuffers == 1, nil);

uint32_t bufferSize = inAaudioBufferList.mBuffers[0].mDataByteSize;
uint8_t *buffer = (uint8_t *)malloc(bufferSize);
memset(buffer, 0, bufferSize);
AudioBufferList outAudioBufferList;
outAudioBufferList.mNumberBuffers = 1;
outAudioBufferList.mBuffers[0].mNumberChannels = inAaudioBufferList.mBuffers[0].mNumberChannels;
outAudioBufferList.mBuffers[0].mDataByteSize = bufferSize;
outAudioBufferList.mBuffers[0].mData = buffer;

UInt32 ioOutputDataPacketSize = 1;

NSAssert(AudioConverterFillComplexBuffer(audioConverter, inInputDataProc, &inAaudioBufferList, &ioOutputDataPacketSize, &outAudioBufferList, NULL) == 0, nil);

NSData *data = [NSData dataWithBytes:outAudioBufferList.mBuffers[0].mData length:outAudioBufferList.mBuffers[0].mDataByteSize];

free(buffer);
CFRelease(blockBuffer);

inInputDataProc() implementation:

OSStatus inInputDataProc(AudioConverterRef inAudioConverter, UInt32 *ioNumberDataPackets, AudioBufferList *ioData, AudioStreamPacketDescription **outDataPacketDescription, void *inUserData)
{
    AudioBufferList audioBufferList = *(AudioBufferList *)inUserData;

    ioData->mBuffers[0].mData = audioBufferList.mBuffers[0].mData;
    ioData->mBuffers[0].mDataByteSize = audioBufferList.mBuffers[0].mDataByteSize;

    return  noErr;
}

Now, the data holds my raw AAC, which I wrap into ADTS frame with proper ADTS header and sequence of these ADTS frames is playable AAC document.

But I don't understand this code as much as I want to. Generally, I don't understand the audio… I've just wrote it somehow following blogs, forums and docs, in pretty much time and now it works but I don't know why and how to change some parameters. So here are my questions:

I need to use this converter during HW encoder is occupied (by AVAssetWriter). This is why I make SW converter via AudioConverterNewSpecific() and not AudioConverterNew(). But now setting outAudioStreamBasicDescription.mFormatID = kAudioFormatMPEG4AAC_HE; does not work. Can't find AudioClassDescription. Even if mFormatFlags is set to 0. What am I loosing by using kAudioFormatMPEG4AAC (kMPEG4Object_AAC_SSR) over kAudioFormatMPEG4AAC_HE? What should I use for live stream? kMPEG4Object_AAC_SSR or kMPEG4Object_AAC_Main?
How to change sample rate properly? If I set outAudioStreamBasicDescription.mSampleRate to 22050 or 8000 for example, the audio playback is like slowed down. I set the sampling frequency index in ADTS header for same frequency as outAudioStreamBasicDescription.mSampleRate is.
How to change bitrate? ffmpeg -i shows this info for produced aac:
Stream #0:0: Audio: aac, 44100 Hz, mono, fltp, 64 kb/s.
How to change it to 16 kbps for example? Bitrate is decreasing as I'm decreasing the frequency, but I believe this is not the only way? And playback is damaged by decreasing the frequency as I'm mentioning in 2 anyway.
How to calculate the size of buffer? Now I set it to uint32_t bufferSize = inAaudioBufferList.mBuffers[0].mDataByteSize; as I believe compressed format won't be larger than uncompressed… But isn't it unnecessarily too much?
How to set ioOutputDataPacketSize properly? If I am getting the documentation right, I should set it as UInt32 ioOutputDataPacketSize = bufferSize / outAudioStreamBasicDescription.mBytesPerPacket; but mBytesPerPacket is 0. If I set it to 0, AudioConverterFillComplexBuffer() returns error. If I set it to 1, it works but I don't know why…
In inInputDataProc() there are 3 "out" parameters. I set just ioData. Should I also set ioNumberDataPackets and outDataPacketDescription? Why and how?

Best Answer

You may need to change the sample rate of the raw audio data by using a resampling audio unit before feeding the audio to the AAC converter. Otherwise there will be a mismatch between the AAC header and the audio data.

AudioStreamBasicDescriptor

Apple's documentation for the ASBD is here. To clarify:

A frame of audio is a time-coincident set of audio samples. In other words, one sample per channel. For Stereo this is therefore 2.
For PCM formats, there is no packetisation. Supposedly, mBytesPerPacket = mBytesPerFrame, mFramesPerPacket=1 but I'm not sure whether this is actually ever used.
mReserved isn't used and must be 0
Refer to The documentation for mFormatID and mFormatFlags. There is a handy helper function CalculateLPCMFlags in CoreAudioTypes.h for computing the latter of these in CoreAudioTypes.h.
Multi-channel audio is generally interleaved (you can set a bit in mFormatFlags if you really don't want it to be).
There's another helper function that can fill out the entire ASBD - FillOutASBDForLPCM() for the common cases of linear PCM.
Lots of combinations of mFormatID and mFormatFlags are not supported by remoteIO units - I found experimentation to be necessary on iOS.

Here's some working code from one of my projects:

AudioStreamBasicDescription inputASBL = {0}; 

inputASBL.mSampleRate =          static_cast<Float64>(sampleRate);
inputASBL.mFormatID =            kAudioFormatLinearPCM;
inputASBL.mFormatFlags =         kAudioFormatFlagIsPacked | kAudioFormatFlagIsSignedInteger,
inputASBL.mFramesPerPacket =     1;
inputASBL.mChannelsPerFrame =    2;
inputASBL.mBitsPerChannel =      sizeof(short) * 8;
inputASBL.mBytesPerPacket =      sizeof(short) * 2;
inputASBL.mBytesPerFrame =       sizeof(short) * 2;
inputASBL.mReserved =            0;

Render Callbacks

CoreAudio operates what Apple describe as a pull model. That is to say, that the render call-back is called form a real-time thread when CoreAudio needs the buffer filling. From your question it appears you are expecting the opposite - pushing the data to the audio output.

There are essentially two implementation choices:

Perform non-blocking reads from the UDP socket in the render callback (as a general rule, anything you do in here should be fast and non-blocking).
Maintain an audio FIFO into which samples are inserted when receive and consumed by the render callback.

The second is probably the better choice, but you are going to need to manage buffer over- and under-runs yourself.

The ioData argument points to a scatter-gather control structure. In the simplest case, it points to one buffer containing all of the frames, but could contain several that between them have sufficient frames to satisfy inNumberFrames. Normally, one pre-allocates a buffer big enough for inNumberFrames, copies samples into it and then modifies the AudioBufferList object pointed to buy ioData to point to it.

In your application you could potentially a scatter-gather approach on your decoded audio packets, allocating buffers as they are decoded. However, you don't always get the latency you wanted and might not be able to arrange for inNumberFrames to be the same as your decoded UDP frames of audio.

How to encode resampled PCM-audio to AAC using ffmpeg-API when input pcm samples count not equal 1024

I also ended up here after having a similar problem. I'm reading audio and video from a Blackmagic Decklink SDI card in 720p50 meaning I had 960 samples per videoframe (48k/50fps) I wanted to encode together with the video. Got really weird audio when only sending 960 samples to aacenc and it didn't really complain about this fact either.

Started to use AVAudioFifo (see ffmpeg/doc/examples/transcode_aac.c) and kept adding frames to it until I had enough frames to satisfy aacenc. This will mean I have samples playing too late I guess, since pts will be set on 1024 samples when the first 960 should really have another value. But, it's not really noticeable as far as I can hear/see.

Best Answer

Related Solutions

Ios – Setting up an Audio Unit format and render callback for interleaved PCM audio

AudioStreamBasicDescriptor

Render Callbacks

How to encode resampled PCM-audio to AAC using ffmpeg-API when input pcm samples count not equal 1024

Related Topic