Ogg timestamps explored

Ogg is the name of a multimedia container format invented by the Xiph Foundation. Moreover, it is a deeply flawed format. One of its many flaws relates to timestamps, an aspect of Ogg I shall explore in this article.

Ogg structure

The Ogg format splits elementary stream data into a sequence of packets which are then distributed arbitrarily across pages. A page can contain any number of packets, and a packet can span any number of pages. This two-level packetisation scheme is used since the packet headers would otherwise, due to design shortfalls elsewhere, become prohibitively large.

Timestamps in Ogg

Each Ogg page (not packet) header contains a timestamp, or granule position in Ogg terms, encoded as a 64-bit number. The precise interpretation of this number is not defined by the Ogg specification; it depends on the codec used for each elementary stream. The specification does, however, tell us one thing:

The position specified is the total samples encoded after including all packets finished on this page (packets begun on this page but continuing on to the next page do not count).

The meaning of samples is, again, left unspecified. It is merely suggested that it could refer to video frames or audio PCM samples.

Timestamping the end of packets, instead of the start, is impractical for a number of reasons including, but not limited to, the following:

  • Scheduling decoded samples for playback is more easily done based on the desired start time than on the end time.
  • Virtually every other container format ties timestamps to the start of the first following sample. Doing it differently only complicates players and other tools supporting multiple formats without providing any advantage.
  • Inferring the timestamp of the first sample of the stream is impossible without first decoding, at least partially, every packet in the first page.

Timestamp interpretation

As mentioned previously, the meaning of the 64-bit timestamps associated with an elementary stream depends on the codec of the stream. I conducted a survey of codecs with defined Ogg mappings looking specifically at their timestamp definitions. Continue reading

GCC inline asm annoyance

Doing some PowerPC work recently, I wanted to use the lwbrx instruction, which loads a little endian word from memory. A simple asm statement wrapped in an inline function seemed like the simplest way to do this.

The lwbrx instruction comes with a minor limitation. It is only available in X-form, that is, the effective address is formed by adding the values of two register operands. Normal load instructions also have a D-form, which computes the effective address by adding an immediate offset to a register operand.

This means that my asm statement cannot use a normal “m” constraint for the memory operand, as this would allow GCC to use D-form addressing, which this instruction does not allow. I thus go in search of a special constraint to request X-form. GCC inline assembler supports a number of machine-specific constraints to cover situations like this one. To my dismay, the manual makes no mention of a suitable contraint to use.

Not giving up hope, I head for Google. Google always has answers. Almost always. None of the queries I can think of return a useful result. My quest finally comes to an end with the GCC machine description for PowerPC. This cryptic file suggests an (undocumented) “Z” constraint might work.

My first attempt at using the newly discovered “Z” constraint fails. The compiler still generates D-form address operands. Another examination of the machine description provides the answer. When referring to the operand, I must use %y0 in place of the usual %0. Needless to say, documentation explaining this syntax is nowhere to be found.

After spending the better part of an hour on a task I expected to take no more than five minutes, I finally arrive at a working solution:

static inline uint32_t load_le32(const uint32_t *p)
{
    uint32_t v;
    asm ("lwbrx %0, %y1" : "=r"(v) : "Z"(*p));
    return v;
}

ARM wish-list

Some time ago, I was asked for a multimedia hacker’s wish-list for a future ARM processor, in particular regarding the NEON vector and floating-point coprocessor. This is my list.

  1. Saturating unsigned+signed add/subtract.
    With the current instruction set, this operation requires six instructions (2x VMOVL, 2x VADDW, 2x VQMOVUN) and two extra registers (one if optimal scheduling is not required) for 128-bit vectors. Furthermore, this is a frequently occuring operation, for instance in the H.264 loop filter.
  2. More registers.
    Having another, say, 8 vector registers would be very handy. Encoding this in the existing instructions would of course be tricky, if at all possible. A special VMOV and/or VSWP instruction to access the high registers would be an acceptable compromise, and would certainly be better than using scratch memory. An alternative option could be to make the high half of the existing register file banked. This could perhaps even be done in some clever way allowing the OS to skip save/restore of these registers for processes that never use them.
  3. 256-bit operations.
    8-element vectors are frequently used in video processing. One example is the ubiquitous 8×8 IDCT. In some instances, 32 bits per element are required in intermediate values to maintain adequate precision. The 8×8 IDCT is once again an example. In these cases, support for 8×32-bit vectors would clearly be an advantage.
  4. Vector sum.
    The sum of all elements in a vector is computed as a part of many algorithms, for instance anything involving a dot product and motion estimation in video encoding. Presently, the only option is to use a sequence of 3 or 4 VPADD instructions.
  5. Transposed load/store.
    When performing the same operation on each of a set of rows, one must load values row-wise into registers, and then transpose the registers before using the vector arithmetic instructions. When done computing, the values are again transposed before being stored row-wise. A set of load/store instructions transferring data between rows in memory and “columns” in the register file would save the cost of the transposing operations.
  6. Improved NEON to ARM transfer.
    On Cortex-A8, transferring a 32-bit value from NEON to an ARM register takes a minimum of 20 clock cycles, during which time any normal access to the ARM register file will stall. This delay makes some potential use cases for NEON practically worthless. I am told this has been addressed in the almost-ready Cortex-A9.

On malice and stupidity

In my previous post, I attributed a quotation to one Robert J. Hanlon. This quotation, known as Hanlon’s Razor, deserves a little more attention.

Firstly, I altered the phrase slightly compared its most common form, “Never attribute to malice that which can be adequately explained by stupidity,” substituting incompetence as the final word. I did this simply because I found this form more suitable in the context.

Secondly, the origin of this adage is disputable. A selection of alternatives follows.

  • In his 1980 book Murphy’s Law Book Two: More Reasons why Things Go Wrong!, Arthur Bloch credits Robert J. Hanlon as the creator, citing the above version.
  • Bill Clarke claims to have coined the phrase in 1974, in the story Axioms of a Mad Poet he published that year.
  • In the short story Logic of Empire (1941) by Robert A. Heinlein a similar phrase appears: You have attributed conditions to villainy that simply result from stupidity.
  • Napoleon Bonaparte allegedly uttered the words “Never ascribe to malice that which is adequately explained by incompetence,” although accurate references do not appear to exist.

Perhaps there is some truth to the saying that great minds think alike.

CodeSourcery’s defence

Having covered the spectacular failure of CodeSourcery’s latest ARM compiler a few days ago, I was engaged in a curious debate on IRC with one of their employees. Fiercely denying the problem at first, he eventually offered an explanation: they do not test the compiler output on real hardware; they use QEMU.

QEMU is a CPU emulator supporting a variety of targets. While great for casual development, and for running foreign applications, it is certainly no substitute for real hardware when testing a compiler. Like any piece of software, an emulator is bound to have a few errors, and as it happens, QEMU has known bugs in its handling of the NEON instruction set. Our friend at CodeSourcery should be well aware of these, also being a QEMU developer.

The use of emulators was explained as a necessity due to real hardware not being available. To be fair, CodeSourcery does develop against new hardware before it exists, so some reliance on emulators is unavoidable. This is, however, not the case this time. The Beagleboard was made available to selected developers quite some time ago (I have had one since May, others still longer), and is now being sold by the thousands. CodeSourcery developers, so I am told, were also given an offer of a free board, an offer they chose to refuse.

What does all this mean? Did Murphy decide to inflict maximum bad luck on the hard-working developers, or is there perhaps a larger conspiracy at work? I shall not attempt to speculate in this matter. I will merely repeat this excellent piece of advice given by Robert J. Hanlon: Never attribute to malice that which can be adequately explained by incompetence.