ARM-NEON memory hazards

The NEON coprocessor found in the Cortex-A8 operates asynchronously from the ARM pipeline, receiving its instructions from the ARM execution unit through a 16-entry FIFO. Furthermore, the NEON unit has its own load/store unit. This suggests that some mechanism exists to resolve data hazards between the ARM and NEON units such that memory operations appear as if the instructions were executed entirely in order.

Although clearly important with a view to code optimisation, the Cortex-A8 Technical Reference Manual unfortunately does not mention any details about these hazards. In fact, it does not mention them at all.

To sched some light on the situation, I ran a simple benchmark to determine two important parameters of ARM-NEON memory hazard resolution: granularity and latency.

Continue reading

CodeSourcery fails again

The bug I discovered in CodeSourcery’s 2008q3 release of their GCC version was apparently deemed serious enough for the company to publish an updated release, tagged 2008q3-72, earlier this week. I took it for a test drive.

Since last time, I have updated the FFmpeg regression test scripts, enabling a cross-build to be easily tested on the target device. For the compiler test this means that much more code will be checked for correct operation compared to the rather limited tests I performed on previous versions. Having verified all tests passing when built with the 2007q3 release, I proceeded with the new 2008q3-72 compiler.

All but one of the FFmpeg regression tests passed. Converting a colour image to 1-bit monochrome format failed. A few minutes of detective work revealed the erroneous code, and a simple test case was easily extracted.

The test case looks strikingly familiar:

extern unsigned char dst[512] __attribute__((aligned(8)));
extern unsigned char src[512] __attribute__((aligned(8)));

void array_shift(void)
    int i;
    for (i = 0; i < 512; i++)
        dst[i] = src[i] >> 7;

Continue reading

Ogg timestamps explored

Ogg is the name of a multimedia container format invented by the Xiph Foundation. Moreover, it is a deeply flawed format. One of its many flaws relates to timestamps, an aspect of Ogg I shall explore in this article.

Ogg structure

The Ogg format splits elementary stream data into a sequence of packets which are then distributed arbitrarily across pages. A page can contain any number of packets, and a packet can span any number of pages. This two-level packetisation scheme is used since the packet headers would otherwise, due to design shortfalls elsewhere, become prohibitively large.

Timestamps in Ogg

Each Ogg page (not packet) header contains a timestamp, or granule position in Ogg terms, encoded as a 64-bit number. The precise interpretation of this number is not defined by the Ogg specification; it depends on the codec used for each elementary stream. The specification does, however, tell us one thing:

The position specified is the total samples encoded after including all packets finished on this page (packets begun on this page but continuing on to the next page do not count).

The meaning of samples is, again, left unspecified. It is merely suggested that it could refer to video frames or audio PCM samples.

Timestamping the end of packets, instead of the start, is impractical for a number of reasons including, but not limited to, the following:

  • Scheduling decoded samples for playback is more easily done based on the desired start time than on the end time.
  • Virtually every other container format ties timestamps to the start of the first following sample. Doing it differently only complicates players and other tools supporting multiple formats without providing any advantage.
  • Inferring the timestamp of the first sample of the stream is impossible without first decoding, at least partially, every packet in the first page.

Timestamp interpretation

As mentioned previously, the meaning of the 64-bit timestamps associated with an elementary stream depends on the codec of the stream. I conducted a survey of codecs with defined Ogg mappings looking specifically at their timestamp definitions. Continue reading

GCC inline asm annoyance

Doing some PowerPC work recently, I wanted to use the lwbrx instruction, which loads a little endian word from memory. A simple asm statement wrapped in an inline function seemed like the simplest way to do this.

The lwbrx instruction comes with a minor limitation. It is only available in X-form, that is, the effective address is formed by adding the values of two register operands. Normal load instructions also have a D-form, which computes the effective address by adding an immediate offset to a register operand.

This means that my asm statement cannot use a normal “m” constraint for the memory operand, as this would allow GCC to use D-form addressing, which this instruction does not allow. I thus go in search of a special constraint to request X-form. GCC inline assembler supports a number of machine-specific constraints to cover situations like this one. To my dismay, the manual makes no mention of a suitable contraint to use.

Not giving up hope, I head for Google. Google always has answers. Almost always. None of the queries I can think of return a useful result. My quest finally comes to an end with the GCC machine description for PowerPC. This cryptic file suggests an (undocumented) “Z” constraint might work.

My first attempt at using the newly discovered “Z” constraint fails. The compiler still generates D-form address operands. Another examination of the machine description provides the answer. When referring to the operand, I must use %y0 in place of the usual %0. Needless to say, documentation explaining this syntax is nowhere to be found.

After spending the better part of an hour on a task I expected to take no more than five minutes, I finally arrive at a working solution:

static inline uint32_t load_le32(const uint32_t *p)
    uint32_t v;
    asm ("lwbrx %0, %y1" : "=r"(v) : "Z"(*p));
    return v;

ARM wish-list

Some time ago, I was asked for a multimedia hacker’s wish-list for a future ARM processor, in particular regarding the NEON vector and floating-point coprocessor. This is my list.

  1. Saturating unsigned+signed add/subtract.
    With the current instruction set, this operation requires six instructions (2x VMOVL, 2x VADDW, 2x VQMOVUN) and two extra registers (one if optimal scheduling is not required) for 128-bit vectors. Furthermore, this is a frequently occuring operation, for instance in the H.264 loop filter.
  2. More registers.
    Having another, say, 8 vector registers would be very handy. Encoding this in the existing instructions would of course be tricky, if at all possible. A special VMOV and/or VSWP instruction to access the high registers would be an acceptable compromise, and would certainly be better than using scratch memory. An alternative option could be to make the high half of the existing register file banked. This could perhaps even be done in some clever way allowing the OS to skip save/restore of these registers for processes that never use them.
  3. 256-bit operations.
    8-element vectors are frequently used in video processing. One example is the ubiquitous 8×8 IDCT. In some instances, 32 bits per element are required in intermediate values to maintain adequate precision. The 8×8 IDCT is once again an example. In these cases, support for 8×32-bit vectors would clearly be an advantage.
  4. Vector sum.
    The sum of all elements in a vector is computed as a part of many algorithms, for instance anything involving a dot product and motion estimation in video encoding. Presently, the only option is to use a sequence of 3 or 4 VPADD instructions.
  5. Transposed load/store.
    When performing the same operation on each of a set of rows, one must load values row-wise into registers, and then transpose the registers before using the vector arithmetic instructions. When done computing, the values are again transposed before being stored row-wise. A set of load/store instructions transferring data between rows in memory and “columns” in the register file would save the cost of the transposing operations.
  6. Improved NEON to ARM transfer.
    On Cortex-A8, transferring a 32-bit value from NEON to an ARM register takes a minimum of 20 clock cycles, during which time any normal access to the ARM register file will stall. This delay makes some potential use cases for NEON practically worthless. I am told this has been addressed in the almost-ready Cortex-A9.