2010-08-25

AAC Bistream Flaws Part 1: The Channel Model

This post is the first in a series about things that I consider to be flaws in the AAC bitstream format itself.

The biggest problem with the AAC bitstream format is the channel model. The whole AAC channel model is fucked.


AAC has 8 types of channel elements (0 = single channel element (SCE), 1 = channel pair element (cpe), 2 = channel coupling element (CCE), 3 = low frequency element (LFE), 4 = data stream elemet (DSE), 5 = program config element (PCE), 6 = fill element (FIL), 7 = END). Each channel element instance has an associated element instance tag (elem_id). The only exception is end. This instance tag is usually but not always used to group instances that are considered to belong together, e.g. a SCE that represents the same channel would have the same element instance_tag in each frame in which it appears.
There are two sorts of AAC channel configurations, indexed and PCE. Every AAC stream has a 4-bit parameter called channelConfiguration. They are defined in the "Channel Configuration" table in subpart 1 of 14496-3.

value# of chanssyntax elements,
in order received
channel to speaker mapping
0--defined in AOTSpecificConfig
11SCEcenter front speaker
22CPEleft, right front speakers
33SCE,
CPE
center front speaker,
left, right front speakers
44SCE,
CPE,
SCE
center front speaker,
left, right center front speakers,
rear surround speakers
55SCE,
CPE,
CPE
center front speaker,
left, right front speakers,
left surround, right surround rear speakers
65.1SCE,
CPE,
CPE,
LFE
center front speaker,
left, right front speakers,
left surround, right surround rear speakers,
front low frequency effects speaker
77.1SCE,
CPE,
CPE,
CPE,
LFE
center front speaker
left, right center front speakers,
left, right outside front speakers,
left surround, right surround rear speakers,
front low frequency effects speaker
8-15--reserved

It seems pretty sane at first. Though it is a little tricky that the channel count is identical to the index until you get to configuration 7. But what if you need something not on the list like 7 channels, true dual mono, or more than 8 channels. Well then channelConfiguration gets set to 0 and an AOTSpecifcConfig is used. For non ER AAC variants (like AAC-LC/HE-AAC/HE-AACv2) this probably means a PCE is to follow.

I say probably because if it is MPEG-2 AAC, the decoder can implicitly figure out a channel mapping from the syntax elements present and needs no PCE. An MPEG-4 AAC decoder is forbidden from doing so. If you remux such a file from an MPEG-2 ADTS stream to a .mp4 file you have probably screwed everything up. I say probably because MPEG-2 objectTypeIndication 0x67 can be used to indicated MPEG-2 AAC in MPEG-4. However in practice MPEG-2 AAC usually gets remuxed to objectTypeIndication 0x40 AAC since MPEG-4 AAC is widely considered a superset of MPEG-2 even though we just have demonstrated it not to be a strict superset.

The first field in a PCE is an element instance tag. Yes, you may have up to 16 independent PCEs. What does that mean? The spec doesn't say how to interpret such a set up. Just that it's legal, and it doesn't place prohibitions on such a thing in any of the useful profiles. But for the sake of argument let's assume we only have one PCE. The next field is a two bit object_type that is equal to the AOT-1 for MPEG-2 compatibility. Now let's not forget that the very existance of a PCE is AOT specific. So we are just sending duplicate data at this point. What if this object_type doesn't match the outer AOT. Well the spec doesn't actually address that, but the experts say such a thing is forbidden. Despite being forbidden the official systems conformance streams don't have PCE object_types that match the outer AOTs. So assuming the object_types line up nicely we get to the sampling_frequency_index. Same problems apply. Finally it coarsely groups the logical output stream into front, side, back, and LFEs. For each non LFE it tells you if the element is a SCE or CPE and its element instance tags. All LFEs are represented with the LFE syntax element so they just have LFEs enumerated. You are then on your own for mapping this mess into a speaker configuration. If you have 22.2 channels or fewer you can use the "informative" ISO 13818-7 Annex H. This Annex is not reprinted in the MPEG-4 edition. In addition to listing These output channels the PCE also enumerates DSEs which hold ancillary data streams, CCEs which hold coupling elements used in the decoding of the output channels, and mixdowns. Here is what the spec says about mixdowns: "The matrix-mixdown provision enables a mode of operation which may be beneficial in some circumstances. However, it is advised that this method should not be used." At the end of the PCE is a comment field. The comment field contains a pascal string that describes the PCE. Before the comment a byte align is required. This makes it much more difficult to move the PCE around. In an MP4 file the PCE lives in the global header, and is byte aligned in relation to the start of the global header. In an ADTS file the PCE is in the actual frame payload and must be byte aligned in relation to the start of the frame.

So now that we have our duplicate codec parameters, a nest of output channels, a list of coupling channels, ancillary data streams, mixdowns that we shouldn't use, and a byte aligned comment we are done with the PCE. What happens if we get another PCE? Well... "An MPEG-4 decoder is always required to parse any program_config_element() inside the AAC payload. However, the decoder is only required to evaluate it, if no channel configuration is given outside the AAC payload." The decoder is has to parse it but may or may not evaluate it, wonderful.

The observant reader will notice that nowhere in the mess of syntax elements that we pulled out of the PCE was their order mentioned. The elements can arrive in any arbitrary order. The order does not even have to be consistent from frame to frame. One of the official test vectors (al17_*) is a dual-mono stream that alternates between {SCE.0}{SCE.15}{END} and {SCE.15}{SCE.0}{END}. This is very flexible, but I have been unable to figure out how this flexibility can ever be used in a beneficial manner. CPEs sometimes come before the channel they modify. The sometimes come afterward. In low memory situations it is useful for them to come before, but the decoder can't depend on them being there due to this flexibility.

To me it seems likely that at one time people may have wanted to do AAC domain mixing. The spec mentions: "programs could share a channel_pair_element() and use distinct coupling channels for voice over in different languages" [ISO/IEC 13818-7:2004(E) 8.5.2.2
Explicit channel mapping using a program_config_element()]. This justifies things like multiple PCEs but still doesn't justify this completely arbitrary and dynamic channel ordering.

Just when you think this crazy ordering is a pain in the ass, but at least it is super flexible, along comes SBR. SBR are contained in FIL elements and they must directly follow the tracks they modify. On a FIL element the element instance tag is actually a size so that the element can be skipped by decoders that don't support such extensions.

Because syntax element order implies speaker order in the non-PCE based channel configurations 1-7, decoders that don't support PCEs sometimes completely ignore element instance tags. Then simple encoders started assigning zeros to all element instances, even instances of the same syntax element in the same frame, e.g. there are 5.1 streams floating around that have a frame structure of {SCE.0}{CPE.0}{CPE.0}{LFE.0}{END}. In addition ADTS files are widely considered to be concatenatable if they contain the same channel count. This creates problems when the streams have the same channel order but different element instance tags on the channels, e.g. one stereo stream may use 0 as the element isntance tag for its CPE while the next stream uses 15. Concatentating them has the stream change element instance tags mid way through. This could be solved by requiring the instances of each syntax element count up from zero, e.g. require 5.1 to use {SCE.0}{CPE.0}{CPE.1}{LFE.0}. The early authors of the FFmpeg AAC decoder thought this was the case; sadly it is not.

The good news is that for complex multichannel files MPEG-D MPEG Surround might be a better choice. MPEG Surround can use AAC as its core coder. The problem is that MPEG Surround doesn't have anywhere near the AAC install base.

Next time: zero sized sections, AAC-960, and ADIF.

2 comments:

  1. This seems like a lot of code to get something simple like channel number and order in a bitstream. Do particular profiles of MP4 (e.g. the iPhone subset) define simpler subsets that the decoder has to support, or do they always support the whole entangled mess?

    ReplyDelete
  2. I don't think they publicly define a subset that they support except for maybe no CCE (that's fairly common) or a maximum channel count.

    Looking at the channel swapping test stream:
    Quick Time for Windows locks up attempting to play it.
    iTunes for Windows plays it wrong.
    WMP plays it wrong
    FAAD2 plays it wrong

    Try it yourself: http://streams.videolan.org/Mpeg_Conformance/ftp.iis.fhg.de/mpeg4audio-conformance/compressedMp4/al17_44.mp4

    ReplyDelete