Implementing startOffsetTime for HTML5
In this first of two technical blog posts on our recent
work on P2P Next, we explain why and how we implemented the
HTML5 media element attribute startOffsetTime
in Firefox to enable accurate synchronisation of
out-of-band timestamped metadata with media streamed live
to a browser over the internet.
First we'll explain our rationale for why we need the functionality this attribute makes possible. Then, we'll go into some technical detail as to how Firefox and Chrome currently interpret the specification. We'll explain how we built a proof of concept implementation in Firefox. Finally we'll state our position on the interpretation of the specification and highlight some challenges in getting this implemented more generally.
The reason for setting this out is that we'd like to see
consistent support for startOffsetTime
across
all commonly used codecs and for browser vendors to bring
their implementations into line with the published HTML5
media elements specification. There are ambiguities in the
specification itself, such as the interpretation of
'earliest seekable position', which could be clarified,
especially with respect to continuous live streaming media.
Browser vendors need to agree on a common interpretation of
attributes such as currentTime
so others can
experiment with the exciting possibilities this new
technology is opening up.
Background
One of the recurring themes of our work over the past few years has been synchronising web-based media with audio/video.
We've tried various techniques, each with their own pros and cons. Visualising radio, for example, relied on the fact that audio streamed live on the web is delayed relative to live FM by about 30 seconds. The solution we implemented was quite rough and ready and didn't take variable network latency into account - not everyone will experience the same amount of delay and buffering introduces even more unpredictability. We compensated by designing the experience not to depend on accurate synchronisation. Still, we knew we could do better.
For our Autumnwatch trial, we had a person manually trigger the synchronised events against a live broadcast. While this was pretty accurate, it's obviously costly in terms of people and doesn't address the fact that DVB and live streamed video are delayed with respect to live TV.
The Secret Fortune trial carried out last year used audio fingerprinting. This can be a reasonably accurate way of synchronisation with live TV but has issues. In particular, it's costly to implement and does not handle network latency and buffering well.
Analogue broadcasts and DVB/DAB are reasonably predictable, though receiver decoding and onboard digital image processing introduce their own delays. A trickier problem to solve is variable network latency over the internet.
What do we mean by 'out-of-band metadata'?
In the context of media streams, 'out-of-band' data means any data not sent in the same data stream as the audio and video.In the context of media streams, 'out-of-band' data means any data not sent in the same data stream as the audio and video. For example, the Matroska container format allows you to embed subtitles in-band along with the audio and video data so you don't need any other files to view subtitles when you play the video. At the same time, it means the media file must contain subtitles for all the possible languages you might want to use and that adding or changing subtitles means re-encoding the file.
An example of out-of-band data would be the files you can load for a film in a media player like VLC. This data is not contained in the same file as the video. This makes it easier to change subtitles or add new languages - you just need to distribute the updated .SRT file on its own.
Why out-of-band timed metadata?
So why do we want to enable out-of-band timed metadata for live streaming? The primary reason is to support 'second screen' applications for live broadcasts viewed in a web browser.
One important potential application for broadcasters is for live events e.g. sports events such as the Olympic Games. Live video is also becoming more widely used by non-broadcasters with sites such as allowing anyone with a webcam to broadcast their own live video stream over the internet.
While there are other standards for interactivity alongside video (e.g. on the web and for internet connected TVs) these haven't yet tackled the problem of true live synchronisation.
Some broadcasters have developed games that can be played on second screens alongside their more popular brands (e.g. Channel4's Million Pound Drop, and Â鶹Éç R&D's trial with Secret Fortune) but these rely on the interactivity being triggered from a central source with all devices remaining in sync rather than the devices being synchronised to the video itself.
The Â鶹Éç has a particular interest in live events, which are likely to be an area where national broadcasters maintain their unique role for some time to come. However the Â鶹Éç also has a keen interest in on-demand media leading the UK market with the iPlayer so it seems natural to find solutions for seamless transition from one to the other which currently do not exist.
What do we need to synchronise with live streams?
To synchronise with a live stream, we need to share a reference clock between the server-side and the client (browser).
Consider how audio and video are synchronised with each other when playing back a media stream. To simplify a little, each audio and video frame has a presentation timestamp in relation to a shared reference clock. On playback, a clock master, usually the sound card which provides a high resolution timer signal (e.g. 44.1 or 48 kHz), is used to calibrate the reference clock and drive the audio and video pipelines. The audio frames are synchronised to this clock master to provide continuous audio while the video frames are served up on a best effort basis (as dropping video frames is less jarring than choppy sound).
When metadata such as subtitles are embedded in an AV container like Matroska or Ogg (using libkate), these metadata are treated in much the same way. They too have presentation timestamps, usually keyed to specific video frames. The main difference with AV frames is that subtitles are discrete events with durations that do not form a continuous timeline. However, they share the same reference clock and are driven by the same clock master.
Why can't we use NTP?
NTP (Network Time Protocol) is a widely used protocol to synchronise a computer's system clock with a remote clock. At first glance, it might seem that this is all we need: make sure the client clock is synchronised to the same reference clock as the server clock and Bob's your uncle, everything is synchronised. Unfortunately, this isn't the case for a number of reasons.
The essential problem is that the server clock is not the same as the stream clock. Even if the server and the stream are synchronised on the server-side (an issue in itself), network latency and buffering will cause unpredictable delays so that by the time the stream reaches the client, it will no longer be in synch with the remote clock. This problem exists for any remote external master clock. Another more subtle problem is that NTP does not guarantee monotonic time - it will occasionally adjust the clock backwards, which would result in stuttering video playback.
Issues with live synchronisation
For on-demand video it's relatively easy to synchronise interactivity as the current playback time is defined in terms of an offset from the start of the media and the client can receive the whole package of timed events in one go.
For live streaming video there are a number of challenges to address:
- A user can join the stream at any time so they won't have received the history of events that have already taken place which may be critical to the display of the interactive media at that point
- They may have connected to one of many stream servers which has started streaming at any time in the past
- Most of the events you want to synchronise with the live media stream may not have happened by the time the user joins the stream so you cannot provide them up front (though you may want to provide expected events such as programme changes in advance)
- In the majority of cases users will always join part way through a stream as there will nearly always be a back history of events of some kind (e.g. a live programme social media commentary may begin long before the broadcast event begins)
- For live streaming there is a similar issue when a user hits pause: a client device can be configured to record all events during the time a media stream is paused, but unless it has a way to resynchronise to the same clock used by that media stream it will have no way to resynchronise those events
- Furthermore, in a production environment, media streams will usually be served by multiple streaming servers which will have been started at different times. So each stream will need its own clock reference
HTML5 timeline
Current implementations of HTML5 media elements on browsers
like Firefox and Chrome expose the media timer (via the
timeupdate
event) so you can attach an event
handler and use that to synchronise external timed
metadata. The origin of this timeline, in all existing
implementations, is time zero. This is fine for discrete
fixed size or on-demand media as such media have a definite
origin and duration, i.e. they start at zero time and all
subsequent times are relative to that start time. This
makes it fairly straightforward to synchronise external
timed metadata. We can simply stamp the metadata with a
time relative to the start of the media, then fire the
event at the right time in the browser.
With live streaming media, things are not so obvious. What do we consider to be the zero time of the media? Is it the time we started streaming the media (this is Chrome's current interpretation)? Or is it the time the browser joined the stream (this is Firefox's)? Note that these are both essentially arbitrary times as a stream can be started at any time (due to failover or restarting, etc.) and a browser can join at any time. In either case, how do we convert the relative time into the stream into a time we can stamp on external metadata?
We need to know the time corresponding to the first frame served on that specific stream. Note that there can be more than one streaming server serving the same media for failover or load balancing purposes. So we need a separate time origin for each (they won't all start at exactly the same time).
If we use the system's wallclock time as the reference clock, we can propagate the server time to the client and synchronise events based on the server's clock. This works where we want to synchronise with the server clock but that is a limited use case. More generally, we want to derive the clock reference from the input media (such as a DVB programme reference clock) and share that with the output stream and the metadata. This enables us to synchronise pre-prepared timed events along with timed metadata generated at transmission time (for example in the studio gallery) with the live stream.
The specifically addresses our use case in the shape of the attribute.
The startOffsetTime attribute must return a new Date object representing the current timeline offset.
which is as:
Some video files also have an explicit date and time corresponding to the zero time in the media timeline, known as the timeline offset.
This sounds great - just what we need. Unfortunately, there's just one snag: of the open source browsers, neither , , nor have implemented it yet.
What we did to implement startOffsetTime
Now, to make this work we need to do two things: 1) provide a timeline offset in the stream on the server side and 2) interpret that offset in the browser codec and make it available to Javascript.
Due to our interest in HTML5, we wanted to use an open container and codecs, which for streaming meant either WebM or Ogg Theora + Vorbis.
is an open
video format that marries with
in a
media container
format. As a version of the Matroska container format, WebM
supports setting an origin for the timeline in the form of
the DateUTC
field in the header.
also supports setting a timeline origin in the UTC header field of an bitstream, which is a "logical bitstream within an Ogg stream that contains information about the other encapsulated logical bitstreams".
In practice, neither Firefox nor Chrome actually use this
field in either WebM or Ogg. To demonstrate our use case we
decided to implement the DateUTC
field in WebM
because it was simpler than learning how to create and
decode a separate logical bitstream in an Ogg container.
WebM DateUTC
To find out how to implement this, we needed to dig into
the specifications. According to the , DateUTC
is 'Supported' so
we are on safe ground implementing it.
The describes the DateUTC
header
field as "Date of the origin of timecode (value 0), i.e.
production date." The date type used as defined in the
is:
Signed, 64-bit (8 byte) integer describing the distance in nanoseconds to the beginning of the millennium (2001-01-01 00:00:00 UTC).
As the HTML5 specification states that
startOffsetTime
is a Javascript Date object
based on the Unix epoch (1 Jan 1970) with a precision of
milliseconds, we had to convert between the two standards.
Patching gstreamer's matroskamux
To serve the live streams, we used , which wraps web-streaming around the .
As we are using gstreamer with flumotion to encode into the
WebM format, we needed to check the implementation of
to see how it handled the
DateUTC
field.
matroskamux
is hard-wired to write the current
time (i.e. the time at which the component is instantiated)
into the DateUTC
field. This isn't really what
we needed so we modified this component to add a
date-utc
property which we could set when we
started up the encoding pipeline. We also patched it to
broadcast each buffer's timestamp via UDP to be picked up
by our event server to synchronise the event stream.
While the first modification is generally useful, the latter is really only appropriate to our experimental set up. Ideally, we want to set the timeline origin from a variety of sources, the most useful being the input media stream. A specific example would be propagating the programme reference clock from MPEG-TS input stream. We would like to investigate this in future work.
Firefox patches
On the Firefox side, we needed to implement both the DOM
interface to startOffsetTime
and the decoder.
Implementing the DOM interface was quite straightforward.
We were able to read through the implementation of
currentTime
to see how the DOM connected up
with the decoder. For example, the code that reads the
startOffsetTime
attributes looks like this:
/* readonly attribute double startOffsetTime; */ NS_IMETHODIMP nsHTMLMediaElement::GetStartOffsetTime(double *aStartOffsetTime) { *aStartOffsetTime = mDecoder ? (mDecoder->GetStartOffsetTime() / 1000.0) : 0; return NS_OK; }
Changes to libnestegg
The code for reading the DateUTC
field is only
slightly more complicated due to the difference in epochs
between Matroska and HTML5:
int64_t date_utc = 0; r = nestegg_date_utc(mContext, &date_utc); if (r == 0) { // convert from matroska epoch to unix epoch // and nanoseconds to milliseconds const int64_t NSEC_PER_SEC = 1000000000LL; const int64_t EBML_DATE_OFFSET = 978307200LL * NSEC_PER_SEC; const int64_t NSEC_PER_MSEC = 1000000LL; date_utc += EBML_DATE_OFFSET; date_utc /= NSEC_PER_MSEC; ReentrantMonitorAutoEnter mon(mDecoder->GetReentrantMonitor()); mDecoder->GetStateMachine()->SetDateUTC(date_utc); }
where nestegg_date_utc
is a function we added
to read the DateUTC
field out of the Matroska
header.
Today - proof of concept. Tomorrow...?
It turned out to be quite straightforward to implement
startOffsetTime
in Firefox and
matroskamux
. The existing specifications for
WebM, Ogg and HTML5 provide the necessary data definitions
- we just needed to hook them up together. Being able to
set our own origin to the media timeline greatly simplifies
the task of synchronising to live media over the web.
Our proof of concept is just a start. We'd like to see the
startOffsetTime
attribute implemented in all
HTML5 compliant browsers and major encoders. However, it's
not just a matter of copying a field from a stream into a
data structure. We all need to agree on what these bits of
data mean.
In the next post, we'll look at how Firefox and Chrome
differ in their interpretations of the
currentTime
attribute and why we think Firefox
is right.
Comment number 1.
At 1st Feb 2012, Chris Howson wrote:Thanks for the insight into the browser-based approach for synchronising Internet media with audio/video. I have been involved in similar research, though aimed more at hybrid broadcast-broadband delivery. We synchronise second screen content to broadcast video by means of a timeline component added to the broadcast MPEG2-TS; rather like subtitle insertion. This timeline expresses the progress of time in the ongoing program. By interrogating the rendering device (e.g. set-top box), a second screen application is able to determine the temporal position of the displayed broadcast video and to synchronise its web-sourced content accordingly. Our work has targeted use cases with exacting synchronisation requirements, such as an alternative view on a tablet accompanying a broadcast sport or music event on the TV. More detail may be found in an article that is available here: .
Whilst our work has focused on hybrid delivery, it could also be employed for your Internet streaming use case, as the timeline may be transported over IP protocols. What I find appealing about the HTML5 solution, though, is compatibility with the ubiquitous web browser, once startOffsetTime is implemented.
Here’s looking forward to your next post.
Complain about this comment (Comment number 1)
Comment number 2.
At 1st Feb 2012, Odin Hørthe Omdal wrote:Oh, great stuff!
You might recall my name from the different bugs you linked. I have since stopped my small HTML5 live video streaming business, and came to work for Opera Software instead.
However, I'm still very interested in startOffsetTime, and a big part of why I started in a browser company was because I found web standards very interesting to follow. ;-)
Complain about this comment (Comment number 2)
Comment number 3.
At 5th Jun 2012, Odin Hørthe Omdal wrote:startOffsetTime has been renamed to the much clearer startDate. And it's not needed for most basic live streaming + synchronization now (not involving CDN's and whatnot), because currentTime will give you how long it was since the stream was started.
Complain about this comment (Comment number 3)