Bugzilla – Bug 903874
Illegal instruction on Core-i5 after s2ram with microcode 0x1c
Last modified: 2015-02-25 15:52:31 UTC
Upgrading from OS-13.1 to OS-13.2 on two different Computers, I observed on terminal: /sbin/yast2: line 440: 8269 Illegal instruction $ybindir/y2base $module "$@" "$SELECTED_GUI" $Y2_GEOMETRY $Y2UI_ARGS while using YaST. I am also observing this 'Illegal instruction' sometimes by issuing a zypper command. Immediately after repeating the command, it mostly works. I think the same behavior/error appears by starting apps as a user, which sometimes simply do not start, only if one starts again, e.g. often upon starting okular. This I have observed on two different computers. Michael
Could you please give the output from cat /proc/cpuinfo I remember that there was a problem with ruby and old processors because of CPU detection happening at compile-time. can you also reproduce this by calling yast2 manually?
Here the output of: su-wptnb8 /root: cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i5-4670S CPU @ 3.10GHz stepping : 3 microcode : 0x1c cpu MHz : 1010.648 cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid bogomips : 6186.22 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i5-4670S CPU @ 3.10GHz stepping : 3 microcode : 0x1c cpu MHz : 899.968 cache size : 6144 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 4 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid bogomips : 6186.22 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i5-4670S CPU @ 3.10GHz stepping : 3 microcode : 0x1c cpu MHz : 926.730 cache size : 6144 KB physical id : 0 siblings : 4 core id : 2 cpu cores : 4 apicid : 4 initial apicid : 4 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid bogomips : 6186.22 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i5-4670S CPU @ 3.10GHz stepping : 3 microcode : 0x1c cpu MHz : 991.273 cache size : 6144 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid bogomips : 6186.22 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management:
The problem is, it happens not all the time. As an observation, it happens often after a while not using this command. And indeed, I always start yast2 from terminal. But as I mentioned, I have the feeling, that other programs are as well affected. Every once a while starting an app (e.g. firefox, okular) those programs do not start only the KDE-start-icon is jumping and only a second klick do really start the program.
the ruby problem was about SSE2. his machine has SSE2. and we patched the ruby build, to not include the SSE2 instructions. this smells more like broken memory/cpu.
If it is a broken cpu/memory, it must be related in some sense to OS-13.2, because I am running this computer since OS-13.1 without any issue and it happened immediately after upgrading.
FWIW, the SSE2 bug was https://bugzilla.opensuse.org/show_bug.cgi?id=889714 You may want to install memtest and reboot to run it from the boot menu: sudo zypper in memtest86+
Ok, I have made the memtest, and it completed without errors.
What can also be done to further trace this down: issue on the shell before the crashing commands: ulimit -c unlimited On crash this will produce a "core" file which you can attach here and one can use it (with debuginfo packages) to locate the faulty instruction
OK, I will try it with ulimit, here is another command with this issue: /sbin/service: line 68: 15282 Illegal instruction systemctl "$2" --full "$1.service" which has appeared while using zypper up.
May be it is related to suspend/resume cycles? I have observed, if I do not suspend the computer, those errors actually do not appear. My computer is 24/7 on and after 7 days without suspending, everything works pretty well. As soon as I resume, the error appears within a few hours/minutes (depends on workload). This is more or less reproduce able. M
Is that suspend-to-RAM or suspend-to-disk? Can you try out the other variant to see if that also causes this problem? The command is s2ram or s2disk (needing some setup in /etc/suspend.conf) My guess is that the problem is somewhere between the hardware, ACPI and the kernel maybe with some corruption of memory happening. To give us a better idea about the hardware, please attach the output of dmidecode
Created attachment 614512 [details] dmidecode.txt
It is s2ram where problem appears. I have: su-wptnb8 /root: free total used free shared buffers cached Mem: 16371812 11054944 5316868 274136 544536 6035040 -/+ buffers/cache: 4475368 11896444 Swap: 4192252 0 4192252 Attached is the dmidecode-output as a file.
Could you enable the core dump generation and analyze where it crashes? This should be the first step.
I see the same issue on my home machine, similar CPU, same microcode level and stepping: cpu family : 6 model : 60 model name : Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz stepping : 3 microcode : 0x1c Definitely corelated with suspend to RAM. I'll enable core dump writing now.
Created attachment 616798 [details] core file of a crashed "okular" binary after s2ram Funny enough, it will start flawlessly on the second attempt...
This behaviour actually does happen for all commands you issue. On console you have to type sometimes two or three times and eventually it works. Ans sometimes you have two click more on okular or other apps.
(In reply to Joerg Reuter from comment #15) > I see the same issue on my home machine, similar CPU, same microcode level > and stepping: > > cpu family : 6 > model : 60 > model name : Intel(R) Core(TM) i5-4570 CPU @ 3.20GHz > stepping : 3 > microcode : 0x1c > > Definitely corelated with suspend to RAM. I'll enable core dump writing now. Hrm, relevant with microcode? What if you downgrade/upgrade ucode-intel package? I suspected drm/i915 problem at first, but it was about S4 in the past...
I've now removed the ucode-intel package, recreated the initrd (IIRC the microcode gets loaded from there, but I may be mistaken), rebooted, lo and behold: microcode 0x16 as loaded by the BIOS does NOT seem to show this behaviour after each of two suspended sessions. So this appears to be microcode related, indeed.
Oh, I didn't expect so much my blind shot really hits. For checking whether it's a kernel side regression or a microcode regression, could you try the following? - Test openSUSE 13.1 kernel and SLE12 kernel with the latest ucode-intel (20140913) enabled - Downgrade to openSUSE 13.1-Update ucode-intel package (20130906): http://download.opensuse.org/update/13.1/x86_64/ucode-intel-20130906-6.2.x86_64.rpm BTW, yes, recreating initrd is needed (unless it's triggered automatically by rpm). You can check the output of lsinitrd whether it includes the early microcode or not.
I finally found the time to test it: - With ucode-intel-20130906-6.2 the microcode level is still 0x16, thus no issues after resume (yes, tested.) - With ucode-intel-20140913-1.1 / microcode level 0x1c, each - kernel-desktop-3.11.10-21.1 / oS 13.1 - kernel-default-3.12.28-4.6 / SLE 12 - kernel-desktop-3.16.6-2.1 / oS 13.2 shows segfaults after s2ram.
Thanks Joerg. So, this looks really like a regression of CPU microcode, at least, for certain model/revision. Keve, could you ask your team to investigate this?
Can you please use gdb on any of the segfaulting programs and dump the faulting instruction? x/i $pc Thanks.
Each and every segfault I've seen happens at: (gdb) x/i $pc => 0x7fa42aa9312b <__lll_lock_elision+75>: xbeginq 0x7fa42aa93131 <__lll_lock_elision+81> (gdb)
(__lll_lock_elision() is a function of /lib64/libpthread.so.0)
*google-di-google* https://lkml.org/lkml/2014/9/18/218 (via https://bugzilla.redhat.com/show_bug.cgi?id=1146967) *headdesk*
Ok this is a known problem. What happens is that the ucode update disables TSX, so disables a bit in the CPUID. It also makes the TSX instructions fault. But you have a TSX enabled platform and glibc pthread is using TSX. It only checks the CPUID at the beginning. So if you load the ucode update after you started a program using pthreads it has already checked TSX CPUID and may be using it, and suddenly it gets disabled, so it gets these faults. The recommended way to handle it is to load this update only with the early microcode update from the initrd. This does the update when nothing is running that could be already using TSX. Alternatively you can disable TSX in the glibc. The first option is better.
In this case, the new microcode got loaded from the initrd already, though. Why is glibc using affected instructions after wake-up from suspend to RAM all of a sudden?
Just to clarify you start the programs immediately after s2ram and they fault? If you start them like 30s later does it still happen? If yes: Somehow you must be racing with a microcode update. When the CPU comes out of S3 the microcode has been reset, so it gets updated. It could be that some of the CPUs being onlined update later.
Could you please upload dmesg after failure?
Created attachment 617916 [details] dmesg after wake-up Sorry that it takes a while, I'm reproducing this on my workstation at home. BTW, I was still getting segfaults seven minutes after wake-up...
And the processes that segfault minutes later also have been started minutes later, or could they have started before or during the suspend?
As far as I can see only newly started processes are affected (the crashing sshd is that of a new SSH session, the sshd process that listens to incoming connections keeps working, as well as ssh sessions started before suspend)
So how and when does glibc autotect if it can use TSX / xbeginq ? And how and when is microcode re-loaded after resume from S3? Because it apears as if glibc thinks, it is OK to use TSX, but when it actually does, the CPU disagrees and the process gets a SIGILL.
(In reply to Bernhard Wiedemann from comment #34) > So how and when does glibc autotect if it can use TSX / xbeginq ? It checks CPUID (RTM bit) AFAIK. > And how and when is microcode re-loaded after resume from S3? The microcode is reloaded at enable_nonboot_cpu(). It invokes load_ucode_ap() in cpu_init() for each secondary CPUs, and this loads the early microcode in the end. You can see the dmesg output: % grep microcode dmesg-after-wakeup.txt [ 0.000000] CPU0 microcode updated early to revision 0x1c, date = 2014-07-03 [ 0.057385] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03 [ 0.073484] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03 [ 0.089577] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03 [ 0.779579] microcode: CPU0 sig=0x306c3, pf=0x2, revision=0x1c [ 0.779585] microcode: CPU1 sig=0x306c3, pf=0x2, revision=0x1c [ 0.779589] microcode: CPU2 sig=0x306c3, pf=0x2, revision=0x1c [ 0.779595] microcode: CPU3 sig=0x306c3, pf=0x2, revision=0x1c [ 0.779633] microcode: Microcode Update Driver: v2.00 <tigran@aivazian.fsnet.co.uk>, Peter Oruba [ 106.931202] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03 [ 106.945339] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03 [ 106.959480] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03 The last three entries should correspond to the microcode application at resume. > Because it apears as if glibc thinks, it is OK to use TSX, > but when it actually does, the CPU disagrees and the process gets a SIGILL. Yes, that's the puzzling part. All look fine only judging from the log, but something is missing...
[ 106.931202] CPU1 microcode updated early to revision 0x1c, date = 2014-07-03 [ 106.945339] CPU2 microcode updated early to revision 0x1c, date = 2014-07-03 [ 106.959480] CPU3 microcode updated early to revision 0x1c, date = 2014-07-03 [ 121.706665] do_trap: 129 callbacks suppressed Illegal instr happens 15 seconds after microcode updated on CPU1-CPU3. So it's unlikely to be a race condition issue. Does S4 suspend to disk hit the issue?
Perhaps not all cores get correctly updated. Could you install msr-tools and run sudo /usr/sbin/rdmsr -a 0x8b before and after the suspend? The number should be always the same and show the same signature as grep microcode /proc/cpuinfo
Hibernate / suspend to disk doesn't work on this machine (swap partition too small.) rdmsr -a 0x8b output before suspend: 1c00000000 1c00000000 1c00000000 1c00000000 after suspend: 1600000000 1c00000000 1c00000000 1c00000000 Woops.
(In reply to Joerg Reuter from comment #38) > Hibernate / suspend to disk doesn't work on this machine (swap partition too > small.) > > rdmsr -a 0x8b output before suspend: > > 1c00000000 > 1c00000000 > 1c00000000 > 1c00000000 > > after suspend: > > 1600000000 > 1c00000000 > 1c00000000 > 1c00000000 > > Woops. So the microcode update is needed for the boot CPU at S3 resume, too? The dmesg showed only the microcode uploads of non-boot CPUs at resume indeed. Adding more relevant people into the loop...
This should all be fixed upstream. It is only in Linus' tree currently so if you're feeling particularly brave, you could try one of the vanilla KOTDs: http://kernel.opensuse.org/packages/vanilla
The current KotD doesn't even properly suspends my machine (screen goes dark, USB device get sent to sleep, but power LED stays on), thus I can't test resume with it...
I'm not surprised. Ok, I'll have backports for 3.16 (oS 13.2) soon, you could test them then ontop of the oS kernel instead. I'll let you know.
Ok, I've backported the upstream fixes, Joerg, can you test please: http://beta.suse.com/private/bpetkov/kernel-3.16.7_default+-6.x86_64.rpm ? Thanks.
Looks good, after three resumes so far all CPUs got their microcode updated correctly: lycaon:~ # rdmsr -a 0x8b 1c00000000 1c00000000 1c00000000 1c00000000 lycaon:~ #
Good, thanks for testing. Closing.
I'm experiencing this bug on an up-to-date 13.2 x64 system with a very similar Haswell processor, despite the fact that it was fixed upstream. Could something be wrong with my configuration to cause this, or is it a possible regression? kernel 3.16.7-7-desktop ucode-intel 20140913-4.1 Output of cat /proc/cpuinfo: processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz stepping : 3 microcode : 0x1c cpu MHz : 3699.093 cache size : 6144 KB physical id : 0 siblings : 4 core id : 0 cpu cores : 4 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid bogomips : 6798.47 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 1 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz stepping : 3 microcode : 0x1c cpu MHz : 2200.039 cache size : 6144 KB physical id : 0 siblings : 4 core id : 1 cpu cores : 4 apicid : 2 initial apicid : 2 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid bogomips : 6798.47 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 2 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz stepping : 3 microcode : 0x1c cpu MHz : 2203.757 cache size : 6144 KB physical id : 0 siblings : 4 core id : 2 cpu cores : 4 apicid : 4 initial apicid : 4 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid bogomips : 6798.47 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management: processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 60 model name : Intel(R) Core(TM) i5-4670 CPU @ 3.40GHz stepping : 3 microcode : 0x1c cpu MHz : 2305.625 cache size : 6144 KB physical id : 0 siblings : 4 core id : 3 cpu cores : 4 apicid : 6 initial apicid : 6 fpu : yes fpu_exception : yes cpuid level : 13 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid bogomips : 6798.47 clflush size : 64 cache_alignment : 64 address sizes : 39 bits physical, 48 bits virtual power management:
(In reply to Ryan Kingsbury from comment #46) > I'm experiencing this bug on an up-to-date 13.2 x64 system with a very > similar Haswell processor, despite the fact that it was fixed upstream. > Could something be wrong with my configuration to cause this, or is it a > possible regression? More likely the backported patches haven't trickled down to the kernel update package. You can try the KOTD from here in the meantime: http://kernel.opensuse.org/packages/openSUSE-13.2
For my system the latest offical SuSE-patch resolves the problem and everything is working.
Hmm...latest official updates definitely do not solve it for me. I reinstalled kernel-desktop and ucode-intel just to be sure. How can I determine whether the patches have actually made it into the official updates? This bug is very disruptive for my workflow and I don't want to wait around for an update if it's actually a different issue.
Sory, for me it is also not solved. I have mad a mistake. The error usually appears for me only after I have used the computer for a while and a lot of memory is used. So the bug actually appears, right yesterday morning after a resume. Anyway, so far I have only applied the official OpenSuSE-13.2 patches, may be they are not yet included.
As I said already in comment#47, the patches are not in the official updates yet. And as I said in that same comment, you should try the Kernel Of The Day here: http://kernel.opensuse.org/packages/openSUSE-13.2 There are instructions on that page how to install it.