AAC Channel Model Revisited

Recently I wrote about the AAC channel model. Since then I tested a variety of real world broken files (synthetically recreated) with the reference decoder, faad, WinAMP, iTunes, and the Microsoft Windows 7 Decoder. I've placed the results in the multimedia wiki. I can't seem to get the CSS right to display it here.

Looking at the results of the other decoders I managed to fix bad_concat while removing special hacks for streams like elem_id0 with a new approach in FFmpeg. FFmpeg now treats streams with PCEs according to the letter of the channel configuration in the PCE. For streams lacking a PCE, element instance tags are now ignored and channels are treated purely positionally.

One other thing worth noting is that I couldn't make iTunes produce Parametric Stereo effects on any of the seven signalling streams from CT. Supposedly iTunes has supported PS since version 9.2. Perhaps it is due to the use of pure upsampling SBR.

AAC Bistream Flaws Part 2: AAC-960, Zero Sized Sections, and ADIF


AAC has a 1024 MDCT samples per frame variant and a 960 MDCT samples per frame variant. You are probably using the 1024 samples per frame variant. Most applications including FFmpeg, WinAMP, and iTunes don't support AAC-960. However until very recently the spec required the AAC/HE-AAC/HE-AACv2 profiles to support both 960 and 1024 variants.

Hilariously nobody seemed to notice this issue until April 2008 when MPEG working document M15437: "proposed clarification for ISO/IEC 14496-4" was published.

Dolby explains the problem much better than I can in MPEG working document M15641: "further considerations regarding the 960 transform length in the AAC profile, the HE AAC profile and the HE AAC v2 profile":

The MPEG-4 AAC-LC AOT can operate in different flavours, either with a frame length of 1024 samples or with a frame length of 960 samples. The AAC profile, the High Efficiency AAC profile, and the High Efficiency AAC v2 profile do not impose any restriction on the frame length that is used.

Typically application standards which build on one of these profiles use only one frame length (either 1024 or 960) and do not require support for both. An incomplete list of application standards that build on these profiles can be found in the Annex.

Implementers of decoders that comply with these profiles were not able to test their implementations for conformance until early this year when conformance sequences for the AAC AOT with a frame length of 960 samples have been made available. Such test sequences are still missing for the SBR AOT and the PS AOT.

There are popular players around that do not support the 960 frame length. Examples of these would be winamp, iTunes or the iPod.

The code from 3GPP is a very popular reference implementation that has been used many times as a starting point for a port to embedded devices. This code does not support the 960 frame length. As a result embedded players based on this code lack support for the 960 frame length.

One important part of the MPEG-4 reference software (mp4mcdec) does not support the 960 frame length. Instead it disregards the frameLengthFlag and consequently plays back at wrong speed.

A profile defines a subset of features that gives content producers the certainty that their content will play on all devices that support the profile. On the other hand there is a high amount of devices that do not comply with the profile by not supporting the 960 frame length, which makes using the 960 frame length a bad choice for creating content.

The solution chosen was to partition the AAC/HE-AAC/HE-AACv2 profiles into regular (1024) and 960 variants. This was clearly the best solution. Still the AAC-960 profile doesn't strike me as particularly useful. It makes short windows only 16 samples (2 ms) shorter. It makes frames 128 samples shorter (16 ms at 8 kHz), however there are Low Delay AAC variants that attack this problem in a more comprehensive manner.

Zero Sized Sections

AAC scalefactor bands are sectioned into groups of adjacent bands that use the same Huffman codebooks. The way these sections are written into the bitstream is that a 4 bit field that specifies the literal codebook number is written then the section length (3 bits for short windows and 5 bits for long windows with an all bits set escape mechanism for longer sections). This is repeated until all the scalefactor bands up to the maximum scalefactor band all have codebooks assigned. Sections with a size of zero are allowed. In theory you can have as many zero sized sections as you want (until you hit the frame size cap in bits). Now combine with an all zero buffer (like a get_bits() implementation that returns zeros when the input buffer is exhausted) and you have a recipe for an infinite loop (thanks to the Google Chrome team for finding this).

Now this isn't the end of the world, a buffer exhaustion check can be added to the sectioning loop to fix this. But these zero sized sections serve no constructive purpose. Zero sized sections could be forbidden, better yet the size minus one could be coded, or they could be used for the escape mechanism.


While not technically AAC this encapsulation format does appear in the AAC specification as an annex and is AAC specific. It is formatted as single a header upfront followed by raw concatenated variable length AAC frames with no additional framing information. People seem to like to use ADIF when it is not an appropriate choice. It has no error resilience and no seekablity due to the way it is framed. Worse yet the ADIF global header is not compatible with the MPEG-4 global header (Audio Specific Config) when the MPEG-4 global header could be used here. This would be a great way to test Audio Specific Config parsing without a full MPEG-4 demuxer. (I invented my own ADIF like format for this very purpose). At least the reason for the different header format is probably a legacy MPEG-2 issue.