Bug 1130438 - meep fails 1 test on old CPU
meep fails 1 test on old CPU
Status: RESOLVED NORESPONSE
Classification: openSUSE
Product: openSUSE Tumbleweed
Classification: openSUSE
Component: Other
Current
Other openSUSE Factory
: P5 - None : Normal (vote)
: ---
Assigned To: Jonathan Brielmaier
E-mail List
:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2019-03-25 19:21 UTC by Bernhard Wiedemann
Modified: 2019-06-11 08:42 UTC (History)
1 user (show)

See Also:
Found By: Development
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Bernhard Wiedemann 2019-03-25 19:21:09 UTC
While working on reproducible builds for openSUSE, I found that
possibly depending on build host load, the 2D_convergence test sometimes fails.

diff between a good and a bad run:
 Frequency difference with a of 15 is 0.0153091/15/15
-frequency for a=10 is 0.180252, 0 (shifted), 0.0901258 (mean)
+frequency for a=10 is 0.180252, 0.181218 (shifted), 0.180735 (mean)
 Unshifted freq error is 0.0307579/10/10
-Shifted freq error is -17.9944/10/10
-meep: Frequency doesn't converge properly with a.
-FAIL 2D_convergence (exit status: 1)
+Shifted freq error is 0.127421/10/10

please investigate, fix and/or coordinate with upstream devs
Comment 1 Bernhard Wiedemann 2019-03-25 20:32:29 UTC
Seems to be related to the build machine's CPU type.
On a newer DDR4-era machine I had to use
osc build --noservice --vm-type=kvm --build-opt=--vm-custom-opt=-cpu\ qemu64
to trigger the test failure.

On a DDR3-era machine from 2010 a plain
osc build
always failed.
Comment 2 Jonathan Brielmaier 2019-03-25 21:50:13 UTC
Does this still apply with https://build.opensuse.org/request/show/686091 ?

In my experience the 2D_convergence fails sometimes, maybe I had a machine where it always fail. I'll try to find out which one it was. For the moment I'm fine with disabling the test and file a bug upstream with all the information we gathered :)
Comment 3 Bernhard Wiedemann 2019-03-26 00:08:15 UTC
Same test failure with your 1.8.0 package.

More playing with the -cpu param showed that it needs all these CPU flags:
+avx,+avx2,+fma,+xsave,+xsaveopt
Comment 4 Jonathan Brielmaier 2019-03-26 10:06:49 UTC
(In reply to Bernhard Wiedemann from comment #3)
> More playing with the -cpu param showed that it needs all these CPU flags:
> +avx,+avx2,+fma,+xsave,+xsaveopt

So does the test fail with this parameters or are these required to let the test pass?

Then we should conditionally disable the test, if the requirements are not met for the test to pass (stable).
Comment 5 Bernhard Wiedemann 2019-03-26 11:49:51 UTC
The CPU flags are required atm to make the test pass.
Comment 6 Jonathan Brielmaier 2019-03-26 17:30:34 UTC
I already filed a bug in February upstream, as the 2D_convergence did failed on Tumbleweed but passed on Leap 15.0 on the same machine:
https://github.com/NanoComp/meep/issues/727

So I assumed back then that the different build toolchain (gcc, glibc) results in different code for the test, which in the end results in the fail.
Comment 7 Jonathan Brielmaier 2019-05-27 11:14:44 UTC
In the mean time they release meep 1.9.0. I packaged it here:
home:jbrielmaier:branches:science/meep

I did run the tests on different machines.

My workstation, Intel i7-3770, avx, no avx2, no fma
TW:        PASS
Leap 15.1: PASS

My laptop, Intel i7-6600U, avx, avx2, fma
TW:        FAIL
Leap 15.1: PASS

Intel Xeon E5-2650Lv3, avx, avx2, fma
TW:        FAIL
Leap 15.1: PASS

AMD Opteron 8218, no avx, no avx2, no fma
TW:        PASS
Leap 15.1: PASS

As we didn't found the root case why this test is failing and upstream doesn't came up with a solution, I just went ahead and disabled this test on Tumbleweed:
https://build.opensuse.org/request/show/705648
Comment 8 Jonathan Brielmaier 2019-06-11 08:42:39 UTC
Test was disabled in https://build.opensuse.org/request/show/707898

I mark it as NORESPONSE as we didn't get any help from upstream and I wont invest any more time here. Feel free to reopen the bug and find a proper solution :)