Standardization of Codec for Immersive Voice and Audio Services (IVAS) at 3GPP

Abstract

The 3rd Generation Partnership Project (3GPP) Service and System Aspects Working Group 4 (SA4) established the main specifications of the speech/audio coding standard: Immersive Voice and Audio Services (IVAS), in 2024 through a fully open collaboration by 11 organizations. IVAS is an extension of Enhanced Voice Services (EVS), which was established in 2014 and is widely used for smartphones worldwide. IVAS has signal-processing tools that support various multi-channel signal formats. During the development of the standard, NTT and Orange proposed an active downmix tool for stereo signals to enhance the quality of the compatible communications with EVS.

Keywords: IVAS, EVS, downmixing

1. Introduction

The 3rd Generation Partnership Project (3GPP) Service and System Aspects Working Group 4 (SA4) has established coding technologies for video and audio and related protocols. The audio subgroup has defined a new high-compression codec of speech and audio signals, suitable for real-time interactive communication with low delay constraints. Enhanced Voice Services (EVS) was established in 2014 through a global collaboration of 13 organizations, resulting in super-wideband, high-compression, low-delay, and high-quality speech/audio coding. Due to its efficient performance, EVS has been successfully deployed to the fourth-generation/fifth-generation (4G/5G) mobile telephone networks and used on almost all smartphones worldwide [1].

Although EVS greatly enhances the quality of monaural telecommunication, there is a potential need for various extended services that make use of multi-channel and multi-stream processing. A new standardization plan for Immersive Voice and Audio Services (IVAS) was initiated in 2018. A completely open collaboration scheme among 11 organizations (in alphabetical order: Dolby, Ericsson, FhG, Huawei, Nokia, NTT, Panasonic, Orange, Philips, Qualcomm, VoiceAge) has been carried out over 3GPP GitLab. In 2024, the main specifications and associated floating-point software were established. IVAS would be the first standard for real-time interactive communication, enabling immersive experiences. Note that there are more activities such as fixed-point implementations and subjective listening tests for characterization.

2. IVAS

2.1 Application areas of IVAS

We expect IVAS to have various application areas listed in Fig. 1. IVAS’s multi-channel and multi-stream functions allow for immersive representation of sound for real-time interactive communication. Some user devices, far beyond current smartphones, may support these application areas. Eyeglasses or goggle devices may be such immersive-communication tools with which users can share acoustic environments and experiences over networks.

Fig. 1. Potential application areas of IVAS.

2.2 Features of IVAS

IVAS is designed under the constraints shown in Fig. 2 and suitable for various applications of immersive communication. IVAS contains coding/decoding/rendering processing for various formats of multi-channel signals, such as stereo, binaural, channel-based, scene-based, object-based, metadata-assisted audio, and their combinations. IVAS has additional functions necessary for network transmission, including discontinuous transmission, packet-loss concealment, jitter-buffer management, and a set of associated protocols.

Fig. 2. Fundamental specifications of IVAS.

Although the International Organization for Standardization/International Electrotechnical Commission (ISO/IEC) Moving Picture Experts Group (MPEG) has established a standard codec for immersive communication, it is suitable only for one-way communication, such as streaming and broadcasting, due to a longer (around 100 ms) algorithmic delay. In contrast to MPEG, which defines only the syntax and decoder, IVAS has defined all specifications and ensures the quality between independent user devices.

2.3 Coding technologies of IVAS

Fundamental compression-encoder processing consists of a spatial analysis part, which extracts spatial parameters of an audio signal, a downmix part, which reduces the number of channels of waveforms, and core coding of the waveforms for compression. The decoder retrieves waveforms and special parameters. The renderer creates the final waveform of the specified format, combining waveforms and special parameters. Detailed algorithm descriptions are highly complicated and fully described in the standard specification documents and associated software [2, 3].

2.4 Compatible communication with EVS

Among the many new features of IVAS for multi-channel and multi-stream processing, retaining compatibility with EVS is important for communicating well with a huge number of legacy terminals. NTT and Orange have proposed an active downmix tool for converting signals from stereo to monaural to enhance the quality of the EVS codec without additional delay.

An arbitrary downmix can generally be employed before the encoder, but this could cause additional delays and no guarantee of quality at the end of legacy terminals. On top of direct connections to legacy terminals, downmixing in an encoder can be helpful in a multi-party teleconference system, as shown in Fig. 3, where two IVAS terminals and one legacy terminal are connected by a select and forward unit (SFU). It is assumed that the IVAS encoder simultaneously produces stereo streams and a downmixed EVS-compatible stream. If we use the above teleconference system, we can maintain the quality of all terminals by making the SFU provide the appropriate stream to each terminal. These streams offer rich stereo sounds for IVAS-equipped terminals and clear monaural sound for a legacy smartphone without extra delay for all terminals.

Fig. 3. Multi-party teleconference by IVAS.

2.5 Active downmix

The simple downmix from stereo to monaural is a passive downmix, which produces only average values from two channels. This process may not provide the best quality for a monaural coder/decoder, depending on various conditions of sound sources and microphones. The active-downmix process, proposed by NTT and Orange, and finally specified in IVAS, has two modes that are adaptively switched according to the characteristics of input signals. One mode is simply an adaptively weighted sum of two channels, and the other has phase compensation before the weighted addition. For both cases, a bigger weight is applied for the preceding channel.

3. Quality evaluation

3.1 Subjective quality of IVAS

Subjective evaluations by independent organizations are essential for deploying standard technologies to actual applications. For floating-point implementation, evaluation was carried out for various conditions (many languages, recording conditions, and formats), and the results have been analyzed and reported [4]. Only one example of these results for stereo signals of music or mixed signals is shown in Fig. 4, comparing the IVAS stereo coding with independent monaural EVS coding. The vertical axis shows the average and 95% confidential interval of scores in a five-point-scale differential mean opinion score (DMOS) (5: No impairment, 4: Small impairment, 3: Moderate impairment, 2: Large impairment, 1: Very large impairment) against the reference, considering both differences in the sound quality and in spatial representation. The horizontal axis shows the total bit rates. We can see that the IVAS stereo coding provided significantly better quality than the independent coding. Note that the lowest bit-rate coding consumed only 1/77 that of the original waveform in linear pulse code modulation.

Fig. 4. An example of subjective listening test for IVAS stereo coding. Extracted from 3GPP TS26.997.

3.2 Subjective quality of downmix

Quality-evaluation experiments were carried out at NTT, which involved comparing the standardized active downmix and passive downmix, as shown in Fig. 5. The vertical axis is also a five-point-scale DMOS, but we see the maximum scores saturated around 4 because the reference is always stereo, while the evaluated signals were always monaural. We used 90 input signals, which are categorized into clean speech, music, and noisy speech. The distance of stereo microphones varied from 5 to 200 cm, and the number of listeners was 24.

Fig. 5. Subjective quality evaluation for downmixing. c: clean speech, m: music, n: noisy speech.

The results indicate that the active downmix provides significantly better quality than the passive downmix at 24.4 kbit/s for all categories and is significantly better for music signals at 13.2 kbit/s [5].

4. Future perspective

As EVS has been widely used worldwide for telephone networks, IVAS is assumed to be used for IP (Internet protocol) Multimedia Subsystem (IMS), an international standard specifying IP telephone networks provided by telecommunication operators. However, it can also be helpful for various devices in combination with the best-effort network. The best-effort network is extremely flexible and easily available, but the connection quality may sometimes be poorer than that of IMS. It is not feasible to standardize all functionalities of applications in IMS. It would therefore be best to combine IMS for speech communication and the Internet for flexible and prompt extensions of various immersive services. To this end, activities are essential for defining interoperable protocols and application programming interfaces between WebRTC and IMS [6, 7].

5. Conclusion

We introduced 3GPP IVAS standardization, focusing on downmixing to ensure a compatible connection to legacy smartphones on the basis of EVS. We expect IVAS to be widely used for various immersive or multi-stream communication services.

References

[1]	M. Dietz, M. Multrus, V. Eksler, V. Malenovsky, E. Norvell, and H. Pobloth, “Overview of the EVS Codec Architecture,” Proc. of the 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2015), South Brisbane, QLD, Australia, 2015, pp. 5698–5702. https://doi.org/10.1109/ICASSP.2015.7179063
[2]	3GPP TS 26.253: Codec for Immersive Voice and Audio Services (IVAS); Detailed Algorithmic Description including RTP payload format and SDP parameter definitions
[3]	S. Bruhn, T. Toftgård, S. Döhla, H-Y. Su, L. Laaksonen, T. Moriya, S. Ragot, H. Ehara, M. Szczerba, I. Varga, A. Schevciw, and M. Jelinek, “3GPP IVAS Codec – Perspectives on Development, Testing and Standardization,” Proc. of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2025), Hyderabad, India, Apr. 2025. https://doi.org/10.1109/ICASSP49660.2025.10889939
[4]	3GPP TR 26.997: Codec for Immersive Voice and Audio Services (IVAS); Performance characterization
[5]	T. Moriya, S. Ragot, A. Lefort, A. Guérin, N. Harada, and R. Sugiura, “Stereo Downmix in 3GPP IVAS for EVS Compatibility,” Proc. of ICASSP 2025, Hyderabad, India, Apr. 2025. https://doi.org/10.1109/ICASSP49660.2025.10888010
[6]	H. Eitoku, “Standardization Trends of Northbound APIs in 3GPP,” NTT Technical Review, Vol. 22, No. 4, pp. 54–57, Apr. 2024. https://doi.org/10.53829/ntr202404gls
[7]	Y. Inoue and R. Suzuki, “Standardization Trends in Real-time Communications at 3GPP,” NTT Technical Review, Vol. 21, No. 1, pp. 34–38, Jan. 2023. https://doi.org/10.53829/ntr202301gls

Takehiro Moriya: NTT Fellow, NTT Communication Science Laboratories.
He received a B.S., M.S., and Ph.D. in mathematical engineering and instrumentation physics from the University of Tokyo in 1978, 1980, and 1989. Since joining NTT laboratories in 1980, he has been engaged in research on speech and audio coding. In 1989, he joined AT&T Bell Labs as a visiting researcher. Since 1990, he has contributed to the standardization of coding schemes for the Japanese Public Digital Cellular system, ITU-T (International Telecommunication Union - Telecommunication Standardization Sector), ISO/IEC MPEG, and 3GPP. He received prestigious awards, including the Medal with Purple Ribbon and the IEEE J. Flanagan Speech and Audio Processing Award. He is an IEEE Life Fellow, fellow and honorary member of the Institute of Electronics, Information and Communication Engineers (IEICE), and honorary member of the Acoustical Society of Japan.

Noboru Harada: Senior Distinguished Researcher, NTT Communication Science Laboratories.
He received a B.S. and M.S. in computer science and systems engineering from the Kyushu Institute of Technology, Fukuoka, in 1995 and 1997, and Ph.D. in computer science from the University of Tsukuba, Ibaraki, in 2017. Since joining NTT in 1997, he has been involved in research on speech and audio signal processing, such as high-efficiency coding, lossless compression, and acoustic event detection, including anomaly sound detection. He is a senior member of IEEE, a member of IEICE, Information Processing Society of Japan, and Acoustical Society of Japan.

↑ TOP