Audio over IP and the Future of Television

By Ken Tankel on Oct 29, 2015 12:00:00 AM


KenTankel, Platform Manager, TV Solutions GroupVideo and audio were handled separately for years before SDI video with embedded audio came into use.  Embedded audio appeared to be a step forward.  The reality is that embedded audio has not reduced lip sync issues and metadata is still easily separated from the audio that it describes.  This is particularly unfortunate because metadata will become more essential with new audio services.  Audio embedding and de-embedding was never perfect and it remains limited.  SDI is, at heart, a video format and it cannot support the future of audio.  Channel based audio is heading for replacement by carriage of the objects that make up the channels.  Multiple languages, emergency audio, and services for the blind are all competing for space in broadcast services.  These expanded audio services provide flexibility and enhanced consumer experiences for broadcast and OTT services and even handheld devices.  AES67, Livewire+ and related standards offer a path to making all of this work – including lip sync!  AES and SMPTE are working together, and the results will enable the sub-sample accurate linking of Audio over IP (which has existed in radio for over a decade and is growing in TV) and video, all without requiring it to be glued together until final delivery.

The Past

Since the first analog television facilities were constructed the broadcast chain has consisted of a string of devices.  Each device was designed to accomplish a highly specialized task.  Closed captioning; stills, graphics, squeeze, crawls, bugs, audio processing and video encoding are each done in a dedicated piece of hardware.  There are also a large number of utility, or “glue”, products such as frame syncs, distribution amplifiers, audio de-embedders and re-embedders and the need to maintain audio and video synchronization.  Along with all the dedicated-function hardware devices come a large number of signal types including: video, analog audio, digital audio, encoded audio, RS422 control, metadata, time code, GPI, video sync, digital audio sync, and Ethernet.  And, if this is not enough to keep track of, there are also a staggering number of protocols, cable types, and connector types required.  Designing a system with all of these specialized products and signal types is not intuitive.  It is installation intensive.  There are many points of failure.

The Present

A consolidation of functionality has been going on in the television broadcast industry.  It is affecting everything from video switchers to audio consoles, signal routing, and the air chain.  This consolidation is reducing the number and types of hardware devices, cables, connectors and protocols needed to deliver a finished program to air.  There are two major advances that are driving this consolidation.  First is widespread use of Ethernet for file distribution, device control, real time video and real time audio delivery.  Second is the increasing power and storage of open, reliable, IT platforms.    

Large increases in computing power have led to big advances in both video and audio processing. In many cases the large leaps in relatively inexpensive computing power have allowed video and audio processing to evolve from one specialized box per function to multiple functions in a single box.   All of these advances allow the broadcaster to use fewer devices in the air-chain.  The advantages of these advances are: higher density, reduced space requirements, less AC power, reduced cooling requirements, less wiring, better system management and faster design and installation.  

The Future

In many regions of the world 5.1 channel audio is already expected by the audience and is common.  Currently standardization for the broadcast of 7.1 channels, and more, for are being worked on.  The idea of delivering an immersive audio experience to the home and providing audio objects and metadata that can place these objects in anything from stereo to 11 (or more) channels of playback is already well developed.  The ability to expand an IP infrastructure to handle any number of additional audio objects and associated metadata is nearly limitless.

There are many reasons for the interest in increasing the number of audio channels that can be broadcast with a video program.  Increased realism or an immersive audio experience is only one of them.  The ability to simultaneously deliver home and away team commentary with venue atmosphere is an example of a new way to serve the large audience for sports.  Descriptive audio tracks for the sight impaired is another capability.  Handling multiple languages and multiple levels of emergency information is another.  The ability to deliver audio in a combination of basic tracks and additional audio objects will allow the audio presentation to be adapted, on-the-fly, to match the audio system of each viewer.  This new audio model can accommodate stereo and various multichannel systems in the home as well as OTT viewing and delivery to the myriad of home and portable devices that viewers use.  The new generation of multichannel and object based audio delivery can provide a personalized audio experience on anything from ear buds, to a handheld device, laptop, TV, or 11 + channel home theater environment. AoIP can provide the flexibility needed for broadcasters to deliver programs with an expanding number of audio channels and audio objects and metadata. 

In addition to the demands of new audio services there are now many formats for video over IP that are competing for attention.  An interface to standardized AoIP for SDI video, and any other new video format carried over any medium, offers many advantages.

Separation of video and AoIP streams means that: Video does not need to be run through a facility to reach loudness management and audio processing devices.  Network cables are all that is needed for audio distribution. The video/audio interface is separate from the audio processing platform and can be changed, as needed, while maintaining the same audio backend.  Likewise, audio processing can be changed without changing the video interface.  How many times in the past has every item in the air chain been replaced to accommodate a new video format?  Composite to 277Mb video to 1.2Gb video to 3Gb video, and now 4K and 8K are already here.

AoIP interfaces that provide GPIO, AES digital audio, analog audio and sync connections, can be placed anywhere within reach of the network.  That means it is possible to place interfaces of the appropriate type right where they are needed.  This can eliminate separate runs of sync, RS-422, time code and GPI and audio cabling.  All of this functionality is handled by the same AoIP connection.  It also means that national emergency announcements, local emergency audio, and local audio cut-ins can easily be added to the AoIP network using AoIP interfaces at the edges of the AoIP network.

AoIP also provides universal contribution and access.  That is, a dedicated audio router is no longer required to allow access to a signal on an AoIP network.  Any AoIP enabled device, or a computer with an AoIP driver, can put audio on the network and receive audio from the network.  The audio output of control rooms, edit bays, and studios can be used anywhere and heard anywhere. 


Ethernet control of many broadcast building blocks is already common.   The growing capabilities of Ethernet delivery of broadcast content and control of broadcast hardware is making it possible to consider broadcast plants with many fewer cables and cable types than has ever been possible before.  AoIP makes it possible to build reliable and flexible facilities and reduce costs too.  The flexibility to easily expand an AoIP based system to accommodate new audio requirements and accommodate new video formats will be critical in the coming years.

Topics: AES67, Audio over IP, AoIP for Television

Recent Posts


If you love broadcast audio, you'll love Telos Alliance's newsletter. Get it delivered to your inbox by subscribing below!