Scalability

previous | contents | next

Scalability

(Bitstream) scalability is the ability of an audio codec to support an ordered set of bit streams which can produce a reconstructed sequence. Moreover, the codec can output useful audio when certain subsets of the bit stream are decoded. The minimum subset that can be decoded is called the base layer. The remaining bit streams in the set are called enhancement or extension layers. Depending on the size of the extension layers we talk about large step or small step (granularity) scalability. Small step scalability denotes enhancement layers of around 1 kbit/s (or smaller). Typical data rates for the extension layers in a large step scalable system are 16 kbit/s or more. Scalability in MPEG-4 natural audio largely relies on difference encoding, either in time domain or, as in the case of AAC layers, of the spectral lines (frequency domain).

Comparison to simulcast

A trivial way to implement bitstream scalability is the simulcast of several bitstreams at different bitrates. Especially in the case of just two layers of scalability, this solution has to be checked against a more complex real scalable system. Depending on the size of the enhancement layers, a scalable system has to take a hit in compression efficiency compared to a similar non-scalable system. Depending on the algorithm, this cost (in terms of bitrate for equivalent quality) can vary widely. For the scalable systems defined in MPEG-4 natural audio, the cost has been estimated in several veri.cation tests. In each of the cases, the scalable system performed better than the equivalent simulcast system. In the optimum case it may be found that the scalable system is improved over the equivalent non-scalable system at the same bitrate. This is expected to happen only for certain combinations and signal classes. An example for this effect is the combination of a speech core coder based on CELP (building on a model of the human vocal tract to enhance the speech quality) and enhancement layers based on AAC (to get higher quality especially for non-speech signals and at higher bitrates). This combination may perform better than AAC for speech signals alone. While the effect has been demonstrated during the core experiment process, it did not show up in the verification test results. Scalability is at the heart of the new MPEG-4 audio functionalities. Some sort of scalability has been built into all of the MPEG-4 natural audio coding algorithms.

Types of scalability in MPEG-4 natural audio

MPEG-4 natural audio allows for a large number of codec combinations for scalability. The combinations for the speech coders are described in the paragraphs explaining MPEG-4 CELP and HVXC. The following list contains the main combinations for MPEG-4 General Audio (GA):

AAC layers only
Narrow-band CELP base layer plus AAC
TwinVQ base layer plus AAC

Depending on the application, either of these possibilities can provide optimum performance. In all cases where good speech quality at low bitrates is a requirement for the case of reception of the core layer only (like for example in a digital broadcasting system using hierarchical channel coding), the speech codec base layer is preferred. If, on the other hand, music should be of reasonable qualityfor a very low bitrate core layer (for example for Internet streaming of music using scalability), the TwinVQ base layer provides the best quality. If the base layer is allowed to work at somewhat higher bitrates (like 16 bit/s or more), a system built from AAC layers only can deliver the best overall performance.

Block length considerations

In the case of combining speech coders and General Audio coding, special consideration has to be given to the frame length of the underlying coding algorithms. This is trivial in the case of different AAC layers at the same sampling frequency. For the speech coders in MPEG-4 natural audio, the frame length is a multiple of 10 ms which does not match the frame lengths normally used in MPEG-4 GA. To accomodate these different frame lenght, two modifications have been done to the scalable system:

AAC modified block length A modified AAC works at a basic block length of 960 samples (instead of the usual 1024). This translates to a block length of 20 ms at 48 kHz sampling frequency. At the other main sampling frequencies for scalable MPEG-4 AAC, the basic block length of the AAC enhancement layers is again a multiple of 10 ms.
Super frame structure To keep a frame a single decodable instance of audio data, several data blocks may be combined into one super-frame. For example, at a sampling frequency of 16 kHz and a core block length for a CELP core of 20 ms, three CELP blocks and one block of AAC enhancement layers are combined into one super-frame.

Mono - stereo scalability

At low bitrates, mono transmission is often preferred to stereo at the same total bitrates. Most listeners evaluate the degradation due to the overhead of stereo transmission to be more annoying than the loss of stereo. For higher bitrates, stereo transmission is virtually a requirement today. Therefore, stereo enhancement layers can be added as enhancement layers to both mono and stereo lower layers.

Overview of scalability modes in MPEG-4 natural audio

The following table lists the possibilities for scalability layers within MPEG-4 natural audio. All of Narrow band CELP (mono), TwinVQ (mono), TwinVQ (stereo), AAC (mono) and AAC (stereo) can be used as core layers. Enhancement layers can be of the types NB CELP mono (on top of CELP only), TwinVQ mono (on top of TwinVQ mono only), TwinVQ stereo (on top of TwinVQ stereo only), AAC mono (on top of NB CELP, TwinVQ mono or AAC mono) or AAC stereo (on top of any of the other codecs).

Table IV: Overview of Scalability Modes
Layer N NB CELP mono TwinVQ mono TwinVQ stereo AAC mono AAC stereo

Narrow Band
CELP mono X X X

TwinVQ mono X X

TwinVQ stereo X X

AAC mono X X

AAC stereo X

Frequency selective switch (FSS) module

Not in all cases the difference signal between the output of a lower layer and the original (frequency domain) signal is the best input to code an enhancement layer. If, for instance a scalable coder using a CELP core coder would be used to encode musical material, the output of the CELP coder may be able to help the enhancement layers in terms of getting an easier signal to encode. To enable more exible coding of enhancement layers, a Frequency Selective Switch (FSS) module has been introduced. It basically consists of a bank of switches operating independently on a scalefactor band basis. For each scalefactor band, one of two inputs into the system can be selected.

Upsampling filter tool

For scalability spanning a wider range of bitrates (from speech quality to CD quality), it is not recommended to run the core coders at the same sampling frequency as the enhancement layer coders. To accomodate this requirement, an upsampling filter tool has been defined. It uses the MDCT (very similar to the IMDCT already present in the AAC decoder) algorithm to perform the filtering. A number of zeroes is inserted into the time domain waveform and used as the input to the MDCT. The output values can then directly combined with MDCT values from a higher sampling frequency filter bank. The prototype filter in this case is the MDCT window function and is the same as used in the AAC IMDCT.

A scalability example

The following example illustrates the combined use of a number of the tools implementing scalability. Fig. 17 shows the decoding of a non-GA (i.e. CELP) mono plus AAC stereo combination and can be found in ISO/IEC International Standard 14496-3 (MPEG-4 audio). A mono CELP is combined with a single stereo AAC layer. Temporal Noise Shaping (TNS) is applied to the MDCT coefficients calculated from the upsampled CELP decoder output. Three FSS (Frequency Selective Switch) modules either combine the upsampled and TNS processed output signal from the core coder with the AAC decoded spectral data or use just the AAC spectral data. Full M/S processing is possible in this combination to enhance the stereo coding efficiency. A core only decoder just uses the CELP data and applies normal CELP postfiltering. The CELP output for higher quality decoding is not postfiltered. Depending on the number of scalability layers, much more complex structures need to be built for MPEG-4 audio scalable decoding.

Figure 17: Scalability Example

previous | contents | next

Layer N	NB CELP mono	TwinVQ mono	TwinVQ stereo	AAC mono	AAC stereo
Narrow Band CELP mono	X			X	X
TwinVQ mono		X			X
TwinVQ stereo			X		X
AAC mono				X	X
AAC stereo					X