Bug 957061 - [nvidia binary] segfault when using TSX (__lll_unlock_elision) (affects plasma5 screen unlocking)
Summary: [nvidia binary] segfault when using TSX (__lll_unlock_elision) (affects plasm...
Status: RESOLVED FIXED
: 960574 (view as bug list)
Alias: None
Product: openSUSE Distribution
Classification: openSUSE
Component: X11 3rd Party Driver (show other bugs)
Version: Leap 42.1
Hardware: Other Other
: P3 - Medium : Normal (vote)
Target Milestone: ---
Assignee: Stefan Dirsch
QA Contact: Stefan Dirsch
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-11-29 22:53 UTC by Nico Kruber
Modified: 2016-11-28 16:21 UTC (History)
7 users (show)

See Also:
Found By: ---
Services Priority:
Business Priority:
Blocker: ---
Marketing QA Status: ---
IT Deployment: ---


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Nico Kruber 2015-11-29 22:53:30 UTC
after installing the NVidia binary driver (version 358.16 "the hard way" or 352.55 from the repository), I am unable to unlock my plasma5 session once the screen locker steps in. This is not due to #931296 since kcheckpass works for me, also after submitting my password, a black screen is shown for about a second and then I'm back at the screen locker.

I dug up several bug reports, but most of them are related to the kcheckpass and pam problem. This is not as the backtrace below shows. It seems that the Arch guys have been affected and they solved it with new Intel microcodes (https://bbs.archlinux.org/viewtopic.php?id=196536), however, that was still a Haswell system while mine is Skylake. I tried with the supplied 4.1.12 Kernel as well as 4.3.0 from Kernel:Stable. I'll post an update with the new microcodes from the Base_System repo (if I get them to be used)


related:
https://bugs.kde.org/show_bug.cgi?id=346938
https://bugs.kde.org/show_bug.cgi?id=346525

FYI (although I don't think, this is related): I do have a Asus Strix GTX 970


=====================================

> gdb /usr/lib64/libexec/kscreenlocker_greet
GNU gdb (GDB; %maintenance_distribution) 7.9.1
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-suse-linux".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://bugs.opensuse.org/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /usr/lib64/libexec/kscreenlocker_greet...(no debugging symbols found)...done.
Missing separate debuginfos, use: zypper install plasma5-workspace-debuginfo-5.4.3-122.1.x86_64
(gdb) run
Starting program: /usr/lib64/libexec/kscreenlocker_greet 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7fffdb6a9700 (LWP 3260)]
qml: No Fill10 element found in your theme's battery.svg - Using legacy 20% steps for battery icon
file:///usr/share/plasma/look-and-feel/org.openSUSE.desktop/contents/components/InfoPane.qml:52:22: Unable to assign [undefined] to int
file:///usr/share/plasma/look-and-feel/org.openSUSE.desktop/contents/lockscreen/LockScreen.qml:165: TypeError: Cannot read property 'showPassword' of undefined
file:///usr/share/plasma/look-and-feel/org.openSUSE.desktop/contents/lockscreen/LockScreen.qml:207: TypeError: Cannot read property 'ButtonLabel' of undefined
Locked at 1448836583
org.kde.keyboardLayout: Layouts list changed:  ("de")
file:///usr/share/plasma/look-and-feel/org.openSUSE.desktop/contents/components/UserDelegate.qml:82:9: QML Image: Cannot open: file:///usr/share/plasma/look-and-feel/org.openSUSE.desktop/contents/components/user-identity
file:///usr/share/plasma/look-and-feel/org.openSUSE.desktop/contents/components/UserDelegate.qml:82:9: QML Image: Cannot open: file:///usr/share/plasma/look-and-feel/org.openSUSE.desktop/contents/components/system-log-out
file:///usr/share/plasma/look-and-feel/org.openSUSE.desktop/contents/components/UserDelegate.qml:82:9: QML Image: Cannot open: file:///usr/share/plasma/look-and-feel/org.openSUSE.desktop/contents/components/system-switch-user
Detaching after fork from child process 3263.
[Thread 0x7fffdb6a9700 (LWP 3260) exited]

Program received signal SIGSEGV, Segmentation fault.
0x00007fffee6d41b8 in __lll_unlock_elision () from /lib64/libpthread.so.0
(gdb) bt
#0  0x00007fffee6d41b8 in __lll_unlock_elision () at /lib64/libpthread.so.0
#1  0x00007fffd80ffccc in  () at /usr/lib64/libEGL_nvidia.so.0
#2  0x00007fffd808d252 in  () at /usr/lib64/libEGL_nvidia.so.0
#3  0x00007fffffffdc70 in  ()
#4  0x00007fffd81150b1 in  () at /usr/lib64/libEGL_nvidia.so.0
#5  0x0000000000000000 in  ()
(gdb) 


=====================================

cat /proc/cpuinfo 
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
stepping        : 3
microcode       : 0x33
cpu MHz         : 800.000
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm hwp hwp_notify hwp_act_window hwp_epp intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
bugs            :
bogomips        : 6383.86
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
stepping        : 3
microcode       : 0x33
cpu MHz         : 800.000
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 4
apicid          : 2
initial apicid  : 2
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm hwp hwp_notify hwp_act_window hwp_epp intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
bugs            :
bogomips        : 6383.86
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
stepping        : 3
microcode       : 0x33
cpu MHz         : 800.000
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 2
cpu cores       : 4
apicid          : 4
initial apicid  : 4
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm hwp hwp_notify hwp_act_window hwp_epp intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
bugs            :
bogomips        : 6383.86
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 6
model           : 94
model name      : Intel(R) Core(TM) i5-6500 CPU @ 3.20GHz
stepping        : 3
microcode       : 0x33
cpu MHz         : 800.000
cache size      : 6144 KB
physical id     : 0
siblings        : 4
core id         : 3
cpu cores       : 4
apicid          : 6
initial apicid  : 6
fpu             : yes
fpu_exception   : yes
cpuid level     : 22
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch ida arat epb pln pts dtherm hwp hwp_notify hwp_act_window hwp_epp intel_pt tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm mpx rdseed adx smap clflushopt xsaveopt xsavec xgetbv1
bugs            :
bogomips        : 6383.86
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
Comment 1 Nico Kruber 2015-11-29 23:14:55 UTC
ok, even after installing the new ucode-intel (and ucode-intel-blob) package from Base:System (which re-creates the initrd), either there are no updates, or this file is not being used (although the mkinitrd reported the use of GenuineIntel.bin)

# dmesg | grep microcode
[    1.479081] microcode: CPU0 sig=0x506e3, pf=0x2, revision=0x33
[    1.479090] microcode: CPU1 sig=0x506e3, pf=0x2, revision=0x33
[    1.479094] microcode: CPU2 sig=0x506e3, pf=0x2, revision=0x33
[    1.479101] microcode: CPU3 sig=0x506e3, pf=0x2, revision=0x33
[    1.479136] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba


Anyway, I currently don't see any workaround, except for installing the nouveau driver which has worse game performance and ~12W more idle power :(
Comment 2 Nico Kruber 2015-11-30 09:41:29 UTC
FYI: according to iucode_tool, there are no microcode updates in the 20151106 microcode updates from https://downloadcenter.intel.com/search?keyword=Linux*+Processor+Microcode+Data+File

> /usr/sbin/iucode_tool -tb -lS <path_to_extracted_microcodes>
/usr/sbin/iucode_tool: system has processor(s) with signature 0x000506e3
selected microcodes:
>
Comment 3 Nico Kruber 2015-11-30 23:37:16 UTC
actually, the problem does not only affect plasma5 - today I encountered it with gnuplot by simply using the CLI and exiting right afterwards (without any input)

I did (finally) find a bug report in the NVidia Forums though:
https://devtalk.nvidia.com/default/topic/893325/linux/newest-and-beta-linux-driver-causing-segmentation-fault-core-dumped-on-all-skylake-platforms/

I'm aware, that openSUSE probably cannot change this and we need to wait for NVidia to fix the original issue - in the meanwhile, maybe this should probably be mentioned in https://en.opensuse.org/openSUSE:Most_annoying_bugs_42.1 or the release notes so people do not blame the distribution.


Nico
Comment 4 Nico Kruber 2015-12-01 00:16:36 UTC
FYI: a temporary workaround is to re-install the following packages _AFTER_ the nvidia binary installer has run:

Mesa-libEGL1 Mesa-libEGL1-32bit

-> this will set the Mesa EGL libs to be used which do not have this bug
Comment 5 Tomáš Chvátal 2015-12-08 12:17:40 UTC
This looks like some packaging problem on the nvidia drivers... Stefan any idea?
Comment 6 Stefan Dirsch 2015-12-08 13:40:57 UTC
(In reply to Nico Kruber from comment #4)
> FYI: a temporary workaround is to re-install the following packages _AFTER_
> the nvidia binary installer has run:
> 
> Mesa-libEGL1 Mesa-libEGL1-32bit
> 
> -> this will set the Mesa EGL libs to be used which do not have this bug

I seriously doubt this. When nvidia-glG0X package is installed libEGL libs by NVIDIA are preferred (via /usr/X11R6/lib{,64} entries in /etc/ld.so.conf.d/nvidia-gfxG0X). Reinstalling Mesa EGL lib packages doesn't change this.
Comment 7 Stefan Dirsch 2015-12-08 13:52:53 UTC
Ok. Not sure how I can help here. If the CPU claims to have the TSX flag, the software tries to use it. Who's wrong here? Firmware maybe still? Compiler? glibc? Or nvidia libs themselves? I don't know. I'm adding my contact at NVIDIA.
Comment 8 Stefan Dirsch 2015-12-08 13:54:29 UTC
Daniel, are you aware of this issue?
Comment 9 Nico Kruber 2015-12-08 14:05:10 UTC
FYI: the people over at Arch Linux [1] claim that the next NVidia driver will have this bug fixed. Unfortunately, the driver is not out yet.

[1] https://bugs.archlinux.org/task/46064?project=1
Comment 10 Stefan Dirsch 2015-12-08 14:28:04 UTC
Possible temporary workaround for now

Add

  /lib64/noelision

to /etc/ld.so.conf on an affected system. Remove this entry again once NVIDIA fixed the issue in their drivers.

Does this help?
Comment 11 Nico Kruber 2015-12-08 20:29:35 UTC
(In reply to Stefan Dirsch from comment #10)
> Add
> 
>   /lib64/noelision
> 
> to /etc/ld.so.conf 

thank you for this (better) workaround - it seems to work
Comment 12 Daniel Dadap 2015-12-08 22:34:48 UTC
In response to Stefan's question in comment #8: yes, NVIDIA is aware of the issue, and a fix has been identified and will hopefully be released soon.
Comment 13 Stefan Dirsch 2015-12-09 10:00:53 UTC
Nico, thanks for the quick response.
Daniel, good to hear this. :-)
Comment 15 Oliver Neukum 2015-12-09 10:26:26 UTC
(In reply to Daniel Dadap from comment #12)
> In response to Stefan's question in comment #8: yes, NVIDIA is aware of the
> issue, and a fix has been identified and will hopefully be released soon.

Hi, what is the fix? It seems to me that the root cause is in libpthread.
Comment 16 Daniel Dadap 2015-12-10 01:31:27 UTC
(In reply to Oliver Neukum from comment #15)
> Hi, what is the fix? It seems to me that the root cause is in libpthread.

A lock was being destroyed twice, which is undefined behavior. This was a bug in the NVIDIA EGL driver, which so far seems to result in adverse effects only when lock elision is enabled in glibc.
Comment 17 Dr. Werner Fink 2016-01-04 19:08:24 UTC
*** Bug 960574 has been marked as a duplicate of this bug. ***
Comment 18 Nico Kruber 2016-01-13 16:27:01 UTC
Finally, version 361.18 (beta) [1] solves this issue and is usable (as opposed to 361.16 beta).

However, due to the integration of the OpenGL Vendor-Neutral Driver (GLVND) infrastructure, I had to re-install (via force update) all xorg and mesa related packages since the old nvidia driver installer removed some libraries. This may be different if installed via packages and only a few package re-installs may actually be required, e.g. libgl-related stuff, but I just wanted to be on the safe side.

Should we close this bug, or should we wait for new rpm packages to be available in the repo?


[1] http://www.geforce.com/drivers/beta-legacy, http://www.geforce.com/drivers/results/97474
Comment 19 Dr. Werner Fink 2016-01-14 08:02:42 UTC
(In reply to Nico Kruber from comment #18)

> Should we close this bug, or should we wait for new rpm packages to be 
> available in the repo?

For most users the rpm way is required I guess. Because the fine art of deep debugging is a special ability
Comment 21 Stefan Dirsch 2016-01-26 14:10:48 UTC
packages with fixed driver (352.79) have been prepared and are expected to be released in the repo until the end of this week.
Comment 22 Stefan Dirsch 2016-01-26 14:11:15 UTC
Clsoing as fixed.