The bug I discovered in CodeSourcery’s 2008q3 release of their GCC version was apparently deemed serious enough for the company to publish an updated release, tagged 2008q3-72, earlier this week. I took it for a test drive.
Since last time, I have updated the FFmpeg regression test scripts, enabling a cross-build to be easily tested on the target device. For the compiler test this means that much more code will be checked for correct operation compared to the rather limited tests I performed on previous versions. Having verified all tests passing when built with the 2007q3 release, I proceeded with the new 2008q3-72 compiler.
All but one of the FFmpeg regression tests passed. Converting a colour image to 1-bit monochrome format failed. A few minutes of detective work revealed the erroneous code, and a simple test case was easily extracted.
The test case looks strikingly familiar:
extern unsigned char dst[512] __attribute__((aligned(8))); extern unsigned char src[512] __attribute__((aligned(8))); void array_shift(void) { int i; for (i = 0; i < 512; i++) dst[i] = src[i] >> 7; }
The aligned(8) attribute is not required to trigger the bug; it merely removes some clutter from the generated assembler. Slightly edited for readability, the assembler output from the compiler looks like this:
array_shift: movw ip, #:lower16:dst movw r0, #:lower16:src movt ip, #:upper16:dst movt r0, #:upper16:src vmov.i32 d17, #249 @ v8qi mov r1, #0 .L2: add r2, ip, r1 add r3, r0, r1 add r1, r1, #8 vldr d16, [r3] cmp r1, #512 vshl.u8 d16, d16, d17 vstr d16, [r2] bne .L2 bx lr
The vectoriser has done its job and decided to use NEON vector operations to process 8 elements in parallel. The mysterious-looking constant 249 is simply the 8-bit representation of -7. The error is in using the vmov.i32 instruction, which writes an immediate value into all 32-bit elements of the destination register. Using the resulting vector as the shift amount with the vshl.u8, which operates on vectors of 8-bit data, clearly will not work as intended. Only one in four elements of the array will be shifted, the rest being copied unchanged. The v8qi annotation next to the incorrect instruction is of particular interest. It indicates that the compiler in fact intended to create an 8-element vector of 8-bit values. The translation of this operation into an assembler instruction seems to have gone horribly wrong. A vmov.i8 instruction would have been correct.
As an experiment, I changed arrays to unsigned short, i.e. 16-bit, elements. This is what the compiler produced:
array_shift: movw ip, #:lower16:dst movw r0, #:lower16:src movt ip, #:upper16:dst movt r0, #:upper16:src mov r1, #0 vldr d17, .L6 .L2: add r2, ip, r1 add r3, r0, r1 add r1, r1, #8 vldr d16, [r3, #0] cmp r1, #1024 vshl.u16 d16, d16, d17 vstr d16, [r2, #0] bne .L2 bx lr .L7: .align 3 .L6: .short -7 .short 0 .short -7 .short 0
The immediate operand of the vmov instruction is limited to 8 bits, so the compiler has decided to load the constant vector from a literal pool following the function. The constant it has placed there is perfectly analogous to the flawed value from the first test: the 16-bit representation of -7 zero-extended into a vector of 32-bit elements.
Finally, I replaced the right shift with a left shift. To my astonishment, the compiler generated the correct vmov.i8 instruction (with a constant of +7). It even repeated this feat with 16-bit arrays.
CodeSourcery insist they subject every compiler release to an extensive test suite. Evidently it does not extend to cover the right shift operator.
That’s disappointing, but expected since NEON support is not that old.
I wonder how the various buggy tests you find behave in qemu which has so many bugs in its NEON implementation.
Mans —
Thank you for continuing to exercise Sourcery G++ with FFMPEG.
Of course, we’re disappointed that you found a defect in the latest
release.
(In the future, it would be helpful if you could post any defects to
our public mailing list here:
http://www.codesourcery.com/archives/arm-gnu/maillist.html
so that we are sure to see them. It’s very helpful to have both
preprocessed source code and command-line options with defect
reports.)
We have been able to reproduce this bug, though not with the source
code you posted. It looks like the case that triggers the defect uses
“unsigned char” rather than “unsigned short” as in your posting. With
that change, and the options “-mcpu=cortex-a8 -mfpu=neon -O3
-mfloat-abi=softfp”, we do see the bug.
You can work around this bug by using “-fno-tree-vectorize” or “-O2”,
both of which will disable automatic vectorization. You can use those
options just on the affected kernels in FFMPEG, while still getting
the benefits of vectorization elsewhere.
As you know, we made available to you a preview version of this latest
release specifically so that you could test FFMPEG with it. You
reported that:
> I’ve done some simple testing, and I’m not finding any bugs this
> time. The test case from before works, and FFmpeg builds and runs
> correctly.
We’ve verified that the same bug that you reported here was indeed
present in that preview release. It sounds like you’ve broadened your
test coverage since then. Obviously, we’re sympathetic to the problem
of testing large, complex software packages; it’s very difficult to
cover all possible cases. We’re continually expanding our test
coverage as well.
We run hundreds of thousands (or millions, depending how you count) of
test cases on every release, both on real hardware (including
Cortex-A8) and in simulation. Those include the DejaGNU regression
testsuites that are part of the FSF sources for the tools, as well as
(depending on the exact release), testsuites and benchmarks like
Plum-Hall, Dhrystone, EEMBC, and SPEC.
We maintain a database of failures in these testsuites, and analyze
new failures to determine their origin. We also do manual testing of
each release in order to provide a “sanity check”. The pass rates in
our testing are generally 99.99% or higher, with the remainder of the
failures explained by vagaries of the test environment.
The vectorization testsuite in DejaGNU does test the compiler’s
ability to vectorize a variety of loops, and, further, that the
vectorized loops run correctly. We know that on NEON our releases
vectorize the same set of loops from these test cases that are
vectorized on other platforms with similar vector capabilities. We
also know that the vectorized loops run correctly on Cortex-A8.
Unfortunately, the testsuites that we have do not cover the test case
which you posted here. Therefore, when we fix this defect, we will
add a new test case to the DejaGNU testsuite, ensuring that the defect
does not reoccur in future releases.
Unfortunately, the GNU toolchain does have defects. The tools
community (in which CodeSourcery is an active participant) attempts to
correct as many defects as possible — both in the context of
volunteer and commercial development — but, inevitably, every release
does have defects. A quick search of the public GCC bug database
indicates 236 open wrong-code bugs; these are all cases where GCC is
known to generate incorrect code on a valid input program, as in your
test case here.
We will continue to do our best to improve the quality of GCC in
general, and our releases in particular. We have entered the defect
you report into our internal defect tracker and expect that it will be
corrected in a future release.
Best regards,
—
Mark Mitchell
CodeSourcery
I accidentally pasted the code with “unsigned short” arrays; this has been corrected. That said, both fail in pretty much the same way, as I explain further on.
Regarding workarounds, I am well aware that -fno-tree-vectorize will disable the broken optimisation. However, having vectorisation with such severe bugs enabled by default at -O3 is not acceptable in my opinion.
The “preview release” of which you speak was nothing other than an evaluation download of the commercial package, and it is correct that I did not immediately find any errors in that release. As I said in the email you quoted above, my test case for the other bug passed, and FFmpeg worked correctly in a few simple tests. I have since added the ability to run the full FFmpeg regression test suite on the target board, which enabled me to find the latest bug.
As for testing, the number of tests, even if inflated into the millions (a number I simply do not believe), is irrelevant. What matters is coverage. In particular, every feature should be exercised at least once during testing. Specifically, if the right-shifting every element in an array is recognised as vectorisable, this should be tested. It is evident that this did not happen.
To the remainder of your game with test suite names and numbers, I can think of no response more appropriate than the words of the brilliant Dijkstra: Testing shows the presence, not the absence of bugs.
Pingback: mv daten /dev/null