Ogg is the name of a multimedia container format invented by the Xiph Foundation. Moreover, it is a deeply flawed format. One of its many flaws relates to timestamps, an aspect of Ogg I shall explore in this article.
The Ogg format splits elementary stream data into a sequence of packets which are then distributed arbitrarily across pages. A page can contain any number of packets, and a packet can span any number of pages. This two-level packetisation scheme is used since the packet headers would otherwise, due to design shortfalls elsewhere, become prohibitively large.
Timestamps in Ogg
Each Ogg page (not packet) header contains a timestamp, or granule position in Ogg terms, encoded as a 64-bit number. The precise interpretation of this number is not defined by the Ogg specification; it depends on the codec used for each elementary stream. The specification does, however, tell us one thing:
The position specified is the total samples encoded after including all packets finished on this page (packets begun on this page but continuing on to the next page do not count).
The meaning of samples is, again, left unspecified. It is merely suggested that it could refer to video frames or audio PCM samples.
Timestamping the end of packets, instead of the start, is impractical for a number of reasons including, but not limited to, the following:
- Scheduling decoded samples for playback is more easily done based on the desired start time than on the end time.
- Virtually every other container format ties timestamps to the start of the first following sample. Doing it differently only complicates players and other tools supporting multiple formats without providing any advantage.
- Inferring the timestamp of the first sample of the stream is impossible without first decoding, at least partially, every packet in the first page.
As mentioned previously, the meaning of the 64-bit timestamps associated with an elementary stream depends on the codec of the stream. I conducted a survey of codecs with defined Ogg mappings looking specifically at their timestamp definitions.
The Vorbis specification includes a convoluted definition of the granule position. In summary, it defines the timestamp as the number of PCM samples obtained by decoding the stream from the beginning, up through the last packet completed on the page.
The FLAC-in-Ogg specification defines its Ogg timestamp as being “same as Vorbis”, and for clarification offers a broken link to the Vorbis spec.
The Ogg mapping for Speex states only that “the granulepos is the number of the last sample encoded in that packet.” There is not even an explanation of which packet this refers to. Speex being a Xiph-made codec, a reasonable assumption is that semantics equal to those for Vorbis and FLAC are intended.
Theora, the much-hyped video codec abandoned by On2 and adopted by Xiph, naturally needs its very own timestamp format. In usual Ogg manner, the timestamp refers to end of the display interval of the last frame obtained after decoding the last packet of the page, but not without a twist. The 64-bit timestamp is split into two fields. The first of these fields encodes the frame number, starting from one, of the key frame most recently preceding the frame to which the timestamp applies. The second timestamp field encodes the number of frames, this time starting from zero, since the most recent key frame. The bit position of the field split is specified by the Theora stream header.
Of the handful of codecs including a specification for use in the Ogg container, Dirac stands out as the only one not invented/branded by Xiph. Rather, it is a product of the BBC.
The Dirac-in-Ogg specification presents a confusing read. First, we are informed that
The unit of encapsulation for Ogg is the packet; a packet of Dirac shall contain:
— zero or more non-picture Dirac data units
followed by a single:
— Dirac picture data unit, or
— Dirac end-of-sequence data unit
NOTE It follows that no Dirac data unit shall span multiple packets.
Next, we are treated to this rule:
In a logical stream of Dirac, an Ogg page should not terminate multiple Ogg packets.
So far, so good. Proceeding to the next sub-clause, still on the same page, things get interesting:
The granule position applies to the picture contained within the first packet that terminates in an Ogg page; any subsequent packets that terminate within the same page do not have a granule position.
Stop right there. Only moments ago, we were told that at most one packet may terminate in any page. There can be no “subsequent packets that terminate within the same page.” That sentence provides nothing but confusion. Perhaps an earlier version of the specification allowed multiple packets per page, and this statement was accidentally left behind.
Finally, we arrive at the definition of the timestamp value itself. Like for Theora, the 64-bit value is made up from several sub-fields. The Dirac spec is agreeably terse in its description:
The granule position is composed of the following:
- Picture number (in display order). When picture_coding_mode = 0 (progressive), pt increments by two for each picture in display order. When picture_coding_mode = 1 (interlace), pt increments by one for each field in display order.
- Number of pictures in coded order (equal to number of packets) since an appropriate sync point that allows for correct decoding of pt. This field is split into two parts, dist_h and dist_l. dist = (dist_h << 8)|dist_l
- Delay (in pictures) between decoding time and presentation time of pt.
These values shall be packed into granule position as follows:6 3 3 2 2 2 3 1 0 9 2 1 9 8 7 0 +---------------------------------+-+--------+-------------+-+--------+ | pt - delay |0| dist_h | delay |0| dist_l | +---------------------------------+-+--------+-------------+-+--------+ |<---------------- high_word --------------->|<------ low_word ------>|
In other words, the stored value encodes the DTS of the frame and the difference between DTS and PTS. What the text fails to relay is whether this time refers to the start or end of the display interval of the frame.
Once despised, now embraced, by Xiph, OGM is a scheme for storing any Microsoft Directshow-compatible data in an Ogg container. A formal specification does not exist, so the only means of obtaining information about this format is to read the oggds source code.
The oggds code incorrectly relates Vorbis timestamps to the beginning of packets, not the end, and non-Vorbis packets are handled in the same fashion. The practical implication of this is that software must treat OGM streams specially, even when they contain Vorbis audio.
Criticism of Ogg’s timestamping approach is all well and fair – it has been developed with Vorbis in mind only and the other codecs and their respective requirements came later, thus requiring e.g. the complex mappings for Theora and Dirac.
However, I’d like to clarify some history that is mentioned on the side here:
* Dirac is not the only codec developed outside Xiph – every codec apart from Vorbis was originally developed outside Xiph and later created a Ogg mapping and joined Xiph. FLAC indeed continues to have a life of its own with .flac files not encapsulated in Ogg and Speex is being used more in VoIP applications without the Ogg encapsulation.
* OGM is not really “embraced” by Xiph.Org – the oggds code after having been donated by Tobias is in the Xiph svn repository, but there is no maintainer. In fact, for windows directshow filter support, the oggcodecs code is now the preferred code, and for srt in Ogg Theora … well, there’ll be something new soon.
I suggest you get involved with the xiph community at [email protected] to express your problems and help fix some of the issues you mentioned. It’s an open source community after all, so open for input. :-)
You are correct about the non-Xiph origins of some of the codecs. However, Dirac is to my knowledge the only codec with an Ogg mapping not linked from the xiph.org web page.